327 lines
9.5 KiB
Markdown

# Ollama Context Proxy
A smart proxy server for Ollama that provides **automatic context size detection** and **URL-based context routing**. This proxy intelligently analyzes incoming requests to determine the optimal context window size, eliminating the need to manually configure context sizes for different types of prompts.
## Why Ollama Context Proxy?
### The Problem
- **Memory Efficiency**: Large context windows consume significantly more GPU memory and processing time
- **Manual Configuration**: Traditional setups require you to manually set context sizes for each request
- **One-Size-Fits-All**: Most deployments use a fixed context size, wasting resources on small prompts or limiting large ones
- **Performance Impact**: Using a 32K context for a simple 100-token prompt is inefficient
### The Solution
Ollama Context Proxy solves these issues by:
1. **🧠 Intelligent Auto-Sizing**: Automatically analyzes prompt content and selects the optimal context size
2. **🎯 Resource Optimization**: Uses smaller contexts for small prompts, larger contexts only when needed
3. **⚡ Performance Boost**: Reduces memory usage and inference time for most requests
4. **🔧 Flexible Routing**: URL-based routing allows explicit context control when needed
5. **🔄 Drop-in Replacement**: Works as a transparent proxy - no client code changes required
## Features
- **Automatic Context Detection**: Analyzes prompts and automatically selects appropriate context sizes
- **URL-Based Routing**: Explicit context control via URL paths (`/proxy-context/4096/api/generate`)
- **Multiple API Support**: Works with Ollama native API and OpenAI-compatible endpoints
- **Streaming Support**: Full support for streaming responses
- **Resource Optimization**: Reduces memory usage by using appropriate context sizes
- **Docker Ready**: Includes Docker configuration for easy deployment
- **Environment Variable Support**: Configurable via `OLLAMA_BASE_URL`
## Quick Start
### Using Docker (Recommended)
```bash
# Build the Docker image
docker build -t ollama-context-proxy .
# Run with default settings (connects to ollama:11434)
docker run -p 11435:11435 ollama-context-proxy
# Run with custom Ollama URL
docker run -p 11435:11435 -e OLLAMA_BASE_URL=http://your-ollama-host:11434 ollama-context-proxy
```
### Direct Python Usage
```bash
# Install dependencies
pip install -r requirements.txt
# Run with auto-detection of Ollama
python3 ollama-context-proxy.py
# Run with custom Ollama host
python3 ollama-context-proxy.py --ollama-host your-ollama-host --ollama-port 11434
```
## Configuration
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `OLLAMA_BASE_URL` | `http://ollama:11434` | Full URL to Ollama server (Docker default) |
### Command Line Arguments
```bash
python3 ollama-context-proxy.py [OPTIONS]
Options:
--ollama-host HOST Ollama server host (default: localhost or from OLLAMA_BASE_URL)
--ollama-port PORT Ollama server port (default: 11434)
--proxy-port PORT Proxy server port (default: 11435)
--log-level LEVEL Log level: DEBUG, INFO, WARNING, ERROR (default: INFO)
```
## Usage Examples
### Automatic Context Sizing (Recommended)
The proxy automatically determines the best context size based on your prompt:
```bash
# Auto-sizing - proxy analyzes prompt and chooses optimal context
curl -X POST http://localhost:11435/proxy-context/auto/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"prompt": "Write a short story about a robot.",
"stream": false
}'
# Chat endpoint with auto-sizing
curl -X POST http://localhost:11435/proxy-context/auto/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```
### Fixed Context Sizes
When you need explicit control over context size:
```bash
# Force 2K context for small prompts
curl -X POST http://localhost:11435/proxy-context/2048/api/generate \
-H "Content-Type: application/json" \
-d '{"model": "llama2", "prompt": "Hello world"}'
# Force 16K context for large prompts
curl -X POST http://localhost:11435/proxy-context/16384/api/generate \
-H "Content-Type: application/json" \
-d '{"model": "llama2", "prompt": "Your very long prompt here..."}'
```
### OpenAI-Compatible Endpoints
```bash
# Auto-sizing with OpenAI-compatible API
curl -X POST http://localhost:11435/proxy-context/auto/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"max_tokens": 150
}'
```
### Health Check
```bash
# Check proxy status and available context sizes
curl http://localhost:11435/health
```
## How Auto-Sizing Works
The proxy uses intelligent analysis to determine optimal context sizes:
1. **Content Analysis**: Extracts and analyzes prompt text from various endpoint formats
2. **Token Estimation**: Estimates input tokens using character-based approximation
3. **Buffer Calculation**: Adds buffers for system prompts, response space, and safety margins
4. **Context Selection**: Chooses the smallest available context that can handle the request
### Available Context Sizes
- **2K** (2048 tokens): Short prompts, simple Q&A
- **4K** (4096 tokens): Medium prompts, code snippets
- **8K** (8192 tokens): Long prompts, detailed instructions
- **16K** (16384 tokens): Very long prompts, document analysis
- **32K** (32768 tokens): Maximum context, large documents
### Auto-Sizing Logic
```
Total Required = Input Tokens + Max Response Tokens + System Overhead + Safety Margin
↓ ↓ ↓ ↓
Estimated from From request 100 tokens 200 tokens
prompt content max_tokens buffer buffer
```
## Docker Compose Integration
Example `docker-compose.yml` integration:
```yaml
version: '3.8'
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
ollama-context-proxy:
build: ./ollama-context-proxy
ports:
- "11435:11435"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
volumes:
ollama_data:
```
## API Endpoints
### Proxy Endpoints
| Endpoint Pattern | Description |
|-----------------|-------------|
| `/proxy-context/auto/{path}` | Auto-detect context size |
| `/proxy-context/{size}/{path}` | Fixed context size (2048, 4096, 8192, 16384, 32768) |
| `/health` | Health check and proxy status |
### Supported Ollama Endpoints
All standard Ollama endpoints are supported through the proxy:
- `/api/generate` - Text generation
- `/api/chat` - Chat completions
- `/api/tags` - List models
- `/api/show` - Model information
- `/v1/chat/completions` - OpenAI-compatible chat
- `/v1/completions` - OpenAI-compatible completions
## Performance Benefits
### Memory Usage Reduction
Using appropriate context sizes can significantly reduce GPU memory usage:
- **2K context**: ~1-2GB GPU memory
- **4K context**: ~2-4GB GPU memory
- **8K context**: ~4-8GB GPU memory
- **16K context**: ~8-16GB GPU memory
- **32K context**: ~16-32GB GPU memory
### Response Time Improvement
Smaller contexts process faster:
- **Simple prompts**: 2-3x faster with auto-sizing vs. fixed 32K
- **Medium prompts**: 1.5-2x faster with optimal sizing
- **Large prompts**: Minimal difference (uses large context anyway)
## Monitoring and Logging
The proxy provides detailed logging for monitoring:
```bash
# Enable debug logging for detailed analysis
python3 ollama-context-proxy.py --log-level DEBUG
```
Log information includes:
- Context size selection reasoning
- Token estimation details
- Request routing information
- Performance metrics
## Troubleshooting
### Common Issues
**Connection Refused**
```bash
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Verify proxy configuration
curl http://localhost:11435/health
```
**Context Size Warnings**
```
Request may exceed largest available context!
```
- The request requires more than 32K tokens
- Consider breaking large prompts into smaller chunks
- Use streaming for very long responses
**Auto-sizing Not Working**
- Ensure you're using `/proxy-context/auto/` in your URLs
- Check request format matches supported endpoints
- Enable DEBUG logging to see analysis details
### Debug Mode
```bash
# Run with debug logging
python3 ollama-context-proxy.py --log-level DEBUG
# This will show:
# - Token estimation details
# - Context selection reasoning
# - Request/response routing info
```
## Development
### Requirements
```bash
pip install aiohttp asyncio
```
### Project Structure
```
ollama-context-proxy/
├── ollama-context-proxy.py # Main proxy server
├── requirements.txt # Python dependencies
├── Dockerfile # Docker configuration
└── README.md # This file
```
### Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## License
[Add your license information here]
## Support
- **Issues**: Report bugs and feature requests via GitHub issues
- **Documentation**: This README and inline code comments
- **Community**: [Add community links if applicable]
---
**Note**: This proxy is designed to work transparently with existing Ollama clients. Simply change your Ollama URL from `http://localhost:11434` to `http://localhost:11435/proxy-context/auto` to enable intelligent context sizing.