# Ollama Context Proxy A smart proxy server for Ollama that provides **automatic context size detection** and **URL-based context routing**. This proxy intelligently analyzes incoming requests to determine the optimal context window size, eliminating the need to manually configure context sizes for different types of prompts. ## Why Ollama Context Proxy? ### The Problem - **Memory Efficiency**: Large context windows consume significantly more GPU memory and processing time - **Manual Configuration**: Traditional setups require you to manually set context sizes for each request - **One-Size-Fits-All**: Most deployments use a fixed context size, wasting resources on small prompts or limiting large ones - **Performance Impact**: Using a 32K context for a simple 100-token prompt is inefficient ### The Solution Ollama Context Proxy solves these issues by: 1. **🧠 Intelligent Auto-Sizing**: Automatically analyzes prompt content and selects the optimal context size 2. **🎯 Resource Optimization**: Uses smaller contexts for small prompts, larger contexts only when needed 3. **⚡ Performance Boost**: Reduces memory usage and inference time for most requests 4. **🔧 Flexible Routing**: URL-based routing allows explicit context control when needed 5. **🔄 Drop-in Replacement**: Works as a transparent proxy - no client code changes required ## Features - **Automatic Context Detection**: Analyzes prompts and automatically selects appropriate context sizes - **URL-Based Routing**: Explicit context control via URL paths (`/proxy-context/4096/api/generate`) - **Multiple API Support**: Works with Ollama native API and OpenAI-compatible endpoints - **Streaming Support**: Full support for streaming responses - **Resource Optimization**: Reduces memory usage by using appropriate context sizes - **Docker Ready**: Includes Docker configuration for easy deployment - **Environment Variable Support**: Configurable via `OLLAMA_BASE_URL` ## Quick Start ### Using Docker (Recommended) ```bash # Build the Docker image docker build -t ollama-context-proxy . # Run with default settings (connects to ollama:11434) docker run -p 11435:11435 ollama-context-proxy # Run with custom Ollama URL docker run -p 11435:11435 -e OLLAMA_BASE_URL=http://your-ollama-host:11434 ollama-context-proxy ``` ### Direct Python Usage ```bash # Install dependencies pip install -r requirements.txt # Run with auto-detection of Ollama python3 ollama-context-proxy.py # Run with custom Ollama host python3 ollama-context-proxy.py --ollama-host your-ollama-host --ollama-port 11434 ``` ## Configuration ### Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `OLLAMA_BASE_URL` | `http://ollama:11434` | Full URL to Ollama server (Docker default) | ### Command Line Arguments ```bash python3 ollama-context-proxy.py [OPTIONS] Options: --ollama-host HOST Ollama server host (default: localhost or from OLLAMA_BASE_URL) --ollama-port PORT Ollama server port (default: 11434) --proxy-port PORT Proxy server port (default: 11435) --log-level LEVEL Log level: DEBUG, INFO, WARNING, ERROR (default: INFO) ``` ## Usage Examples ### Automatic Context Sizing (Recommended) The proxy automatically determines the best context size based on your prompt: ```bash # Auto-sizing - proxy analyzes prompt and chooses optimal context curl -X POST http://localhost:11435/proxy-context/auto/api/generate \ -H "Content-Type: application/json" \ -d '{ "model": "llama2", "prompt": "Write a short story about a robot.", "stream": false }' # Chat endpoint with auto-sizing curl -X POST http://localhost:11435/proxy-context/auto/api/chat \ -H "Content-Type: application/json" \ -d '{ "model": "llama2", "messages": [{"role": "user", "content": "Hello!"}] }' ``` ### Fixed Context Sizes When you need explicit control over context size: ```bash # Force 2K context for small prompts curl -X POST http://localhost:11435/proxy-context/2048/api/generate \ -H "Content-Type: application/json" \ -d '{"model": "llama2", "prompt": "Hello world"}' # Force 16K context for large prompts curl -X POST http://localhost:11435/proxy-context/16384/api/generate \ -H "Content-Type: application/json" \ -d '{"model": "llama2", "prompt": "Your very long prompt here..."}' ``` ### OpenAI-Compatible Endpoints ```bash # Auto-sizing with OpenAI-compatible API curl -X POST http://localhost:11435/proxy-context/auto/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama2", "messages": [{"role": "user", "content": "Explain quantum computing"}], "max_tokens": 150 }' ``` ### Health Check ```bash # Check proxy status and available context sizes curl http://localhost:11435/health ``` ## How Auto-Sizing Works The proxy uses intelligent analysis to determine optimal context sizes: 1. **Content Analysis**: Extracts and analyzes prompt text from various endpoint formats 2. **Token Estimation**: Estimates input tokens using character-based approximation 3. **Buffer Calculation**: Adds buffers for system prompts, response space, and safety margins 4. **Context Selection**: Chooses the smallest available context that can handle the request ### Available Context Sizes - **2K** (2048 tokens): Short prompts, simple Q&A - **4K** (4096 tokens): Medium prompts, code snippets - **8K** (8192 tokens): Long prompts, detailed instructions - **16K** (16384 tokens): Very long prompts, document analysis - **32K** (32768 tokens): Maximum context, large documents ### Auto-Sizing Logic ``` Total Required = Input Tokens + Max Response Tokens + System Overhead + Safety Margin ↓ ↓ ↓ ↓ Estimated from From request 100 tokens 200 tokens prompt content max_tokens buffer buffer ``` ## Docker Compose Integration Example `docker-compose.yml` integration: ```yaml version: '3.8' services: ollama: image: ollama/ollama ports: - "11434:11434" volumes: - ollama_data:/root/.ollama ollama-context-proxy: build: ./ollama-context-proxy ports: - "11435:11435" environment: - OLLAMA_BASE_URL=http://ollama:11434 depends_on: - ollama volumes: ollama_data: ``` ## API Endpoints ### Proxy Endpoints | Endpoint Pattern | Description | |-----------------|-------------| | `/proxy-context/auto/{path}` | Auto-detect context size | | `/proxy-context/{size}/{path}` | Fixed context size (2048, 4096, 8192, 16384, 32768) | | `/health` | Health check and proxy status | ### Supported Ollama Endpoints All standard Ollama endpoints are supported through the proxy: - `/api/generate` - Text generation - `/api/chat` - Chat completions - `/api/tags` - List models - `/api/show` - Model information - `/v1/chat/completions` - OpenAI-compatible chat - `/v1/completions` - OpenAI-compatible completions ## Performance Benefits ### Memory Usage Reduction Using appropriate context sizes can significantly reduce GPU memory usage: - **2K context**: ~1-2GB GPU memory - **4K context**: ~2-4GB GPU memory - **8K context**: ~4-8GB GPU memory - **16K context**: ~8-16GB GPU memory - **32K context**: ~16-32GB GPU memory ### Response Time Improvement Smaller contexts process faster: - **Simple prompts**: 2-3x faster with auto-sizing vs. fixed 32K - **Medium prompts**: 1.5-2x faster with optimal sizing - **Large prompts**: Minimal difference (uses large context anyway) ## Monitoring and Logging The proxy provides detailed logging for monitoring: ```bash # Enable debug logging for detailed analysis python3 ollama-context-proxy.py --log-level DEBUG ``` Log information includes: - Context size selection reasoning - Token estimation details - Request routing information - Performance metrics ## Troubleshooting ### Common Issues **Connection Refused** ```bash # Check if Ollama is running curl http://localhost:11434/api/tags # Verify proxy configuration curl http://localhost:11435/health ``` **Context Size Warnings** ``` Request may exceed largest available context! ``` - The request requires more than 32K tokens - Consider breaking large prompts into smaller chunks - Use streaming for very long responses **Auto-sizing Not Working** - Ensure you're using `/proxy-context/auto/` in your URLs - Check request format matches supported endpoints - Enable DEBUG logging to see analysis details ### Debug Mode ```bash # Run with debug logging python3 ollama-context-proxy.py --log-level DEBUG # This will show: # - Token estimation details # - Context selection reasoning # - Request/response routing info ``` ## Development ### Requirements ```bash pip install aiohttp asyncio ``` ### Project Structure ``` ollama-context-proxy/ ├── ollama-context-proxy.py # Main proxy server ├── requirements.txt # Python dependencies ├── Dockerfile # Docker configuration └── README.md # This file ``` ### Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Add tests if applicable 5. Submit a pull request ## License [Add your license information here] ## Support - **Issues**: Report bugs and feature requests via GitHub issues - **Documentation**: This README and inline code comments - **Community**: [Add community links if applicable] --- **Note**: This proxy is designed to work transparently with existing Ollama clients. Simply change your Ollama URL from `http://localhost:11434` to `http://localhost:11435/proxy-context/auto` to enable intelligent context sizing.