9.5 KiB
Ollama Context Proxy
A smart proxy server for Ollama that provides automatic context size detection and URL-based context routing. This proxy intelligently analyzes incoming requests to determine the optimal context window size, eliminating the need to manually configure context sizes for different types of prompts.
Why Ollama Context Proxy?
The Problem
- Memory Efficiency: Large context windows consume significantly more GPU memory and processing time
- Manual Configuration: Traditional setups require you to manually set context sizes for each request
- One-Size-Fits-All: Most deployments use a fixed context size, wasting resources on small prompts or limiting large ones
- Performance Impact: Using a 32K context for a simple 100-token prompt is inefficient
The Solution
Ollama Context Proxy solves these issues by:
- 🧠 Intelligent Auto-Sizing: Automatically analyzes prompt content and selects the optimal context size
- 🎯 Resource Optimization: Uses smaller contexts for small prompts, larger contexts only when needed
- ⚡ Performance Boost: Reduces memory usage and inference time for most requests
- 🔧 Flexible Routing: URL-based routing allows explicit context control when needed
- 🔄 Drop-in Replacement: Works as a transparent proxy - no client code changes required
Features
- Automatic Context Detection: Analyzes prompts and automatically selects appropriate context sizes
- URL-Based Routing: Explicit context control via URL paths (
/proxy-context/4096/api/generate
) - Multiple API Support: Works with Ollama native API and OpenAI-compatible endpoints
- Streaming Support: Full support for streaming responses
- Resource Optimization: Reduces memory usage by using appropriate context sizes
- Docker Ready: Includes Docker configuration for easy deployment
- Environment Variable Support: Configurable via
OLLAMA_BASE_URL
Quick Start
Using Docker (Recommended)
# Build the Docker image
docker build -t ollama-context-proxy .
# Run with default settings (connects to ollama:11434)
docker run -p 11435:11435 ollama-context-proxy
# Run with custom Ollama URL
docker run -p 11435:11435 -e OLLAMA_BASE_URL=http://your-ollama-host:11434 ollama-context-proxy
Direct Python Usage
# Install dependencies
pip install -r requirements.txt
# Run with auto-detection of Ollama
python3 ollama-context-proxy.py
# Run with custom Ollama host
python3 ollama-context-proxy.py --ollama-host your-ollama-host --ollama-port 11434
Configuration
Environment Variables
Variable | Default | Description |
---|---|---|
OLLAMA_BASE_URL |
http://ollama:11434 |
Full URL to Ollama server (Docker default) |
Command Line Arguments
python3 ollama-context-proxy.py [OPTIONS]
Options:
--ollama-host HOST Ollama server host (default: localhost or from OLLAMA_BASE_URL)
--ollama-port PORT Ollama server port (default: 11434)
--proxy-port PORT Proxy server port (default: 11435)
--log-level LEVEL Log level: DEBUG, INFO, WARNING, ERROR (default: INFO)
Usage Examples
Automatic Context Sizing (Recommended)
The proxy automatically determines the best context size based on your prompt:
# Auto-sizing - proxy analyzes prompt and chooses optimal context
curl -X POST http://localhost:11435/proxy-context/auto/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"prompt": "Write a short story about a robot.",
"stream": false
}'
# Chat endpoint with auto-sizing
curl -X POST http://localhost:11435/proxy-context/auto/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Fixed Context Sizes
When you need explicit control over context size:
# Force 2K context for small prompts
curl -X POST http://localhost:11435/proxy-context/2048/api/generate \
-H "Content-Type: application/json" \
-d '{"model": "llama2", "prompt": "Hello world"}'
# Force 16K context for large prompts
curl -X POST http://localhost:11435/proxy-context/16384/api/generate \
-H "Content-Type: application/json" \
-d '{"model": "llama2", "prompt": "Your very long prompt here..."}'
OpenAI-Compatible Endpoints
# Auto-sizing with OpenAI-compatible API
curl -X POST http://localhost:11435/proxy-context/auto/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"max_tokens": 150
}'
Health Check
# Check proxy status and available context sizes
curl http://localhost:11435/health
How Auto-Sizing Works
The proxy uses intelligent analysis to determine optimal context sizes:
- Content Analysis: Extracts and analyzes prompt text from various endpoint formats
- Token Estimation: Estimates input tokens using character-based approximation
- Buffer Calculation: Adds buffers for system prompts, response space, and safety margins
- Context Selection: Chooses the smallest available context that can handle the request
Available Context Sizes
- 2K (2048 tokens): Short prompts, simple Q&A
- 4K (4096 tokens): Medium prompts, code snippets
- 8K (8192 tokens): Long prompts, detailed instructions
- 16K (16384 tokens): Very long prompts, document analysis
- 32K (32768 tokens): Maximum context, large documents
Auto-Sizing Logic
Total Required = Input Tokens + Max Response Tokens + System Overhead + Safety Margin
↓ ↓ ↓ ↓
Estimated from From request 100 tokens 200 tokens
prompt content max_tokens buffer buffer
Docker Compose Integration
Example docker-compose.yml
integration:
version: '3.8'
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
ollama-context-proxy:
build: ./ollama-context-proxy
ports:
- "11435:11435"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
volumes:
ollama_data:
API Endpoints
Proxy Endpoints
Endpoint Pattern | Description |
---|---|
/proxy-context/auto/{path} |
Auto-detect context size |
/proxy-context/{size}/{path} |
Fixed context size (2048, 4096, 8192, 16384, 32768) |
/health |
Health check and proxy status |
Supported Ollama Endpoints
All standard Ollama endpoints are supported through the proxy:
/api/generate
- Text generation/api/chat
- Chat completions/api/tags
- List models/api/show
- Model information/v1/chat/completions
- OpenAI-compatible chat/v1/completions
- OpenAI-compatible completions
Performance Benefits
Memory Usage Reduction
Using appropriate context sizes can significantly reduce GPU memory usage:
- 2K context: ~1-2GB GPU memory
- 4K context: ~2-4GB GPU memory
- 8K context: ~4-8GB GPU memory
- 16K context: ~8-16GB GPU memory
- 32K context: ~16-32GB GPU memory
Response Time Improvement
Smaller contexts process faster:
- Simple prompts: 2-3x faster with auto-sizing vs. fixed 32K
- Medium prompts: 1.5-2x faster with optimal sizing
- Large prompts: Minimal difference (uses large context anyway)
Monitoring and Logging
The proxy provides detailed logging for monitoring:
# Enable debug logging for detailed analysis
python3 ollama-context-proxy.py --log-level DEBUG
Log information includes:
- Context size selection reasoning
- Token estimation details
- Request routing information
- Performance metrics
Troubleshooting
Common Issues
Connection Refused
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Verify proxy configuration
curl http://localhost:11435/health
Context Size Warnings
Request may exceed largest available context!
- The request requires more than 32K tokens
- Consider breaking large prompts into smaller chunks
- Use streaming for very long responses
Auto-sizing Not Working
- Ensure you're using
/proxy-context/auto/
in your URLs - Check request format matches supported endpoints
- Enable DEBUG logging to see analysis details
Debug Mode
# Run with debug logging
python3 ollama-context-proxy.py --log-level DEBUG
# This will show:
# - Token estimation details
# - Context selection reasoning
# - Request/response routing info
Development
Requirements
pip install aiohttp asyncio
Project Structure
ollama-context-proxy/
├── ollama-context-proxy.py # Main proxy server
├── requirements.txt # Python dependencies
├── Dockerfile # Docker configuration
└── README.md # This file
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
License
[Add your license information here]
Support
- Issues: Report bugs and feature requests via GitHub issues
- Documentation: This README and inline code comments
- Community: [Add community links if applicable]
Note: This proxy is designed to work transparently with existing Ollama clients. Simply change your Ollama URL from http://localhost:11434
to http://localhost:11435/proxy-context/auto
to enable intelligent context sizing.