327 lines
9.5 KiB
Markdown
327 lines
9.5 KiB
Markdown
# Ollama Context Proxy
|
|
|
|
A smart proxy server for Ollama that provides **automatic context size detection** and **URL-based context routing**. This proxy intelligently analyzes incoming requests to determine the optimal context window size, eliminating the need to manually configure context sizes for different types of prompts.
|
|
|
|
## Why Ollama Context Proxy?
|
|
|
|
### The Problem
|
|
- **Memory Efficiency**: Large context windows consume significantly more GPU memory and processing time
|
|
- **Manual Configuration**: Traditional setups require you to manually set context sizes for each request
|
|
- **One-Size-Fits-All**: Most deployments use a fixed context size, wasting resources on small prompts or limiting large ones
|
|
- **Performance Impact**: Using a 32K context for a simple 100-token prompt is inefficient
|
|
|
|
### The Solution
|
|
Ollama Context Proxy solves these issues by:
|
|
|
|
1. **🧠 Intelligent Auto-Sizing**: Automatically analyzes prompt content and selects the optimal context size
|
|
2. **🎯 Resource Optimization**: Uses smaller contexts for small prompts, larger contexts only when needed
|
|
3. **⚡ Performance Boost**: Reduces memory usage and inference time for most requests
|
|
4. **🔧 Flexible Routing**: URL-based routing allows explicit context control when needed
|
|
5. **🔄 Drop-in Replacement**: Works as a transparent proxy - no client code changes required
|
|
|
|
## Features
|
|
|
|
- **Automatic Context Detection**: Analyzes prompts and automatically selects appropriate context sizes
|
|
- **URL-Based Routing**: Explicit context control via URL paths (`/proxy-context/4096/api/generate`)
|
|
- **Multiple API Support**: Works with Ollama native API and OpenAI-compatible endpoints
|
|
- **Streaming Support**: Full support for streaming responses
|
|
- **Resource Optimization**: Reduces memory usage by using appropriate context sizes
|
|
- **Docker Ready**: Includes Docker configuration for easy deployment
|
|
- **Environment Variable Support**: Configurable via `OLLAMA_BASE_URL`
|
|
|
|
## Quick Start
|
|
|
|
### Using Docker (Recommended)
|
|
|
|
```bash
|
|
# Build the Docker image
|
|
docker build -t ollama-context-proxy .
|
|
|
|
# Run with default settings (connects to ollama:11434)
|
|
docker run -p 11435:11435 ollama-context-proxy
|
|
|
|
# Run with custom Ollama URL
|
|
docker run -p 11435:11435 -e OLLAMA_BASE_URL=http://your-ollama-host:11434 ollama-context-proxy
|
|
```
|
|
|
|
### Direct Python Usage
|
|
|
|
```bash
|
|
# Install dependencies
|
|
pip install -r requirements.txt
|
|
|
|
# Run with auto-detection of Ollama
|
|
python3 ollama-context-proxy.py
|
|
|
|
# Run with custom Ollama host
|
|
python3 ollama-context-proxy.py --ollama-host your-ollama-host --ollama-port 11434
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `OLLAMA_BASE_URL` | `http://ollama:11434` | Full URL to Ollama server (Docker default) |
|
|
|
|
### Command Line Arguments
|
|
|
|
```bash
|
|
python3 ollama-context-proxy.py [OPTIONS]
|
|
|
|
Options:
|
|
--ollama-host HOST Ollama server host (default: localhost or from OLLAMA_BASE_URL)
|
|
--ollama-port PORT Ollama server port (default: 11434)
|
|
--proxy-port PORT Proxy server port (default: 11435)
|
|
--log-level LEVEL Log level: DEBUG, INFO, WARNING, ERROR (default: INFO)
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
### Automatic Context Sizing (Recommended)
|
|
|
|
The proxy automatically determines the best context size based on your prompt:
|
|
|
|
```bash
|
|
# Auto-sizing - proxy analyzes prompt and chooses optimal context
|
|
curl -X POST http://localhost:11435/proxy-context/auto/api/generate \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "llama2",
|
|
"prompt": "Write a short story about a robot.",
|
|
"stream": false
|
|
}'
|
|
|
|
# Chat endpoint with auto-sizing
|
|
curl -X POST http://localhost:11435/proxy-context/auto/api/chat \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "llama2",
|
|
"messages": [{"role": "user", "content": "Hello!"}]
|
|
}'
|
|
```
|
|
|
|
### Fixed Context Sizes
|
|
|
|
When you need explicit control over context size:
|
|
|
|
```bash
|
|
# Force 2K context for small prompts
|
|
curl -X POST http://localhost:11435/proxy-context/2048/api/generate \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "llama2", "prompt": "Hello world"}'
|
|
|
|
# Force 16K context for large prompts
|
|
curl -X POST http://localhost:11435/proxy-context/16384/api/generate \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "llama2", "prompt": "Your very long prompt here..."}'
|
|
```
|
|
|
|
### OpenAI-Compatible Endpoints
|
|
|
|
```bash
|
|
# Auto-sizing with OpenAI-compatible API
|
|
curl -X POST http://localhost:11435/proxy-context/auto/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "llama2",
|
|
"messages": [{"role": "user", "content": "Explain quantum computing"}],
|
|
"max_tokens": 150
|
|
}'
|
|
```
|
|
|
|
### Health Check
|
|
|
|
```bash
|
|
# Check proxy status and available context sizes
|
|
curl http://localhost:11435/health
|
|
```
|
|
|
|
## How Auto-Sizing Works
|
|
|
|
The proxy uses intelligent analysis to determine optimal context sizes:
|
|
|
|
1. **Content Analysis**: Extracts and analyzes prompt text from various endpoint formats
|
|
2. **Token Estimation**: Estimates input tokens using character-based approximation
|
|
3. **Buffer Calculation**: Adds buffers for system prompts, response space, and safety margins
|
|
4. **Context Selection**: Chooses the smallest available context that can handle the request
|
|
|
|
### Available Context Sizes
|
|
|
|
- **2K** (2048 tokens): Short prompts, simple Q&A
|
|
- **4K** (4096 tokens): Medium prompts, code snippets
|
|
- **8K** (8192 tokens): Long prompts, detailed instructions
|
|
- **16K** (16384 tokens): Very long prompts, document analysis
|
|
- **32K** (32768 tokens): Maximum context, large documents
|
|
|
|
### Auto-Sizing Logic
|
|
|
|
```
|
|
Total Required = Input Tokens + Max Response Tokens + System Overhead + Safety Margin
|
|
↓ ↓ ↓ ↓
|
|
Estimated from From request 100 tokens 200 tokens
|
|
prompt content max_tokens buffer buffer
|
|
```
|
|
|
|
## Docker Compose Integration
|
|
|
|
Example `docker-compose.yml` integration:
|
|
|
|
```yaml
|
|
version: '3.8'
|
|
services:
|
|
ollama:
|
|
image: ollama/ollama
|
|
ports:
|
|
- "11434:11434"
|
|
volumes:
|
|
- ollama_data:/root/.ollama
|
|
|
|
ollama-context-proxy:
|
|
build: ./ollama-context-proxy
|
|
ports:
|
|
- "11435:11435"
|
|
environment:
|
|
- OLLAMA_BASE_URL=http://ollama:11434
|
|
depends_on:
|
|
- ollama
|
|
|
|
volumes:
|
|
ollama_data:
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
### Proxy Endpoints
|
|
|
|
| Endpoint Pattern | Description |
|
|
|-----------------|-------------|
|
|
| `/proxy-context/auto/{path}` | Auto-detect context size |
|
|
| `/proxy-context/{size}/{path}` | Fixed context size (2048, 4096, 8192, 16384, 32768) |
|
|
| `/health` | Health check and proxy status |
|
|
|
|
### Supported Ollama Endpoints
|
|
|
|
All standard Ollama endpoints are supported through the proxy:
|
|
|
|
- `/api/generate` - Text generation
|
|
- `/api/chat` - Chat completions
|
|
- `/api/tags` - List models
|
|
- `/api/show` - Model information
|
|
- `/v1/chat/completions` - OpenAI-compatible chat
|
|
- `/v1/completions` - OpenAI-compatible completions
|
|
|
|
## Performance Benefits
|
|
|
|
### Memory Usage Reduction
|
|
|
|
Using appropriate context sizes can significantly reduce GPU memory usage:
|
|
|
|
- **2K context**: ~1-2GB GPU memory
|
|
- **4K context**: ~2-4GB GPU memory
|
|
- **8K context**: ~4-8GB GPU memory
|
|
- **16K context**: ~8-16GB GPU memory
|
|
- **32K context**: ~16-32GB GPU memory
|
|
|
|
### Response Time Improvement
|
|
|
|
Smaller contexts process faster:
|
|
|
|
- **Simple prompts**: 2-3x faster with auto-sizing vs. fixed 32K
|
|
- **Medium prompts**: 1.5-2x faster with optimal sizing
|
|
- **Large prompts**: Minimal difference (uses large context anyway)
|
|
|
|
## Monitoring and Logging
|
|
|
|
The proxy provides detailed logging for monitoring:
|
|
|
|
```bash
|
|
# Enable debug logging for detailed analysis
|
|
python3 ollama-context-proxy.py --log-level DEBUG
|
|
```
|
|
|
|
Log information includes:
|
|
- Context size selection reasoning
|
|
- Token estimation details
|
|
- Request routing information
|
|
- Performance metrics
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**Connection Refused**
|
|
```bash
|
|
# Check if Ollama is running
|
|
curl http://localhost:11434/api/tags
|
|
|
|
# Verify proxy configuration
|
|
curl http://localhost:11435/health
|
|
```
|
|
|
|
**Context Size Warnings**
|
|
```
|
|
Request may exceed largest available context!
|
|
```
|
|
- The request requires more than 32K tokens
|
|
- Consider breaking large prompts into smaller chunks
|
|
- Use streaming for very long responses
|
|
|
|
**Auto-sizing Not Working**
|
|
- Ensure you're using `/proxy-context/auto/` in your URLs
|
|
- Check request format matches supported endpoints
|
|
- Enable DEBUG logging to see analysis details
|
|
|
|
### Debug Mode
|
|
|
|
```bash
|
|
# Run with debug logging
|
|
python3 ollama-context-proxy.py --log-level DEBUG
|
|
|
|
# This will show:
|
|
# - Token estimation details
|
|
# - Context selection reasoning
|
|
# - Request/response routing info
|
|
```
|
|
|
|
## Development
|
|
|
|
### Requirements
|
|
|
|
```bash
|
|
pip install aiohttp asyncio
|
|
```
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
ollama-context-proxy/
|
|
├── ollama-context-proxy.py # Main proxy server
|
|
├── requirements.txt # Python dependencies
|
|
├── Dockerfile # Docker configuration
|
|
└── README.md # This file
|
|
```
|
|
|
|
### Contributing
|
|
|
|
1. Fork the repository
|
|
2. Create a feature branch
|
|
3. Make your changes
|
|
4. Add tests if applicable
|
|
5. Submit a pull request
|
|
|
|
## License
|
|
|
|
[Add your license information here]
|
|
|
|
## Support
|
|
|
|
- **Issues**: Report bugs and feature requests via GitHub issues
|
|
- **Documentation**: This README and inline code comments
|
|
- **Community**: [Add community links if applicable]
|
|
|
|
---
|
|
|
|
**Note**: This proxy is designed to work transparently with existing Ollama clients. Simply change your Ollama URL from `http://localhost:11434` to `http://localhost:11435/proxy-context/auto` to enable intelligent context sizing.
|