llm/ollama-context-proxy
..
2025-07-31 15:55:14 -07:00
2025-07-31 15:55:14 -07:00
2025-07-31 15:55:14 -07:00

Ollama Context Proxy

A smart proxy server for Ollama that provides automatic context size detection and URL-based context routing. This proxy intelligently analyzes incoming requests to determine the optimal context window size, eliminating the need to manually configure context sizes for different types of prompts.

Why Ollama Context Proxy?

The Problem

  • Memory Efficiency: Large context windows consume significantly more GPU memory and processing time
  • Manual Configuration: Traditional setups require you to manually set context sizes for each request
  • One-Size-Fits-All: Most deployments use a fixed context size, wasting resources on small prompts or limiting large ones
  • Performance Impact: Using a 32K context for a simple 100-token prompt is inefficient

The Solution

Ollama Context Proxy solves these issues by:

  1. 🧠 Intelligent Auto-Sizing: Automatically analyzes prompt content and selects the optimal context size
  2. 🎯 Resource Optimization: Uses smaller contexts for small prompts, larger contexts only when needed
  3. Performance Boost: Reduces memory usage and inference time for most requests
  4. 🔧 Flexible Routing: URL-based routing allows explicit context control when needed
  5. 🔄 Drop-in Replacement: Works as a transparent proxy - no client code changes required

Features

  • Automatic Context Detection: Analyzes prompts and automatically selects appropriate context sizes
  • URL-Based Routing: Explicit context control via URL paths (/proxy-context/4096/api/generate)
  • Multiple API Support: Works with Ollama native API and OpenAI-compatible endpoints
  • Streaming Support: Full support for streaming responses
  • Resource Optimization: Reduces memory usage by using appropriate context sizes
  • Docker Ready: Includes Docker configuration for easy deployment
  • Environment Variable Support: Configurable via OLLAMA_BASE_URL

Quick Start

# Build the Docker image
docker build -t ollama-context-proxy .

# Run with default settings (connects to ollama:11434)
docker run -p 11435:11435 ollama-context-proxy

# Run with custom Ollama URL
docker run -p 11435:11435 -e OLLAMA_BASE_URL=http://your-ollama-host:11434 ollama-context-proxy

Direct Python Usage

# Install dependencies
pip install -r requirements.txt

# Run with auto-detection of Ollama
python3 ollama-context-proxy.py

# Run with custom Ollama host
python3 ollama-context-proxy.py --ollama-host your-ollama-host --ollama-port 11434

Configuration

Environment Variables

Variable Default Description
OLLAMA_BASE_URL http://ollama:11434 Full URL to Ollama server (Docker default)

Command Line Arguments

python3 ollama-context-proxy.py [OPTIONS]

Options:
  --ollama-host HOST     Ollama server host (default: localhost or from OLLAMA_BASE_URL)
  --ollama-port PORT     Ollama server port (default: 11434)
  --proxy-port PORT      Proxy server port (default: 11435)
  --log-level LEVEL      Log level: DEBUG, INFO, WARNING, ERROR (default: INFO)

Usage Examples

The proxy automatically determines the best context size based on your prompt:

# Auto-sizing - proxy analyzes prompt and chooses optimal context
curl -X POST http://localhost:11435/proxy-context/auto/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "prompt": "Write a short story about a robot.",
    "stream": false
  }'

# Chat endpoint with auto-sizing
curl -X POST http://localhost:11435/proxy-context/auto/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Fixed Context Sizes

When you need explicit control over context size:

# Force 2K context for small prompts
curl -X POST http://localhost:11435/proxy-context/2048/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "llama2", "prompt": "Hello world"}'

# Force 16K context for large prompts
curl -X POST http://localhost:11435/proxy-context/16384/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "llama2", "prompt": "Your very long prompt here..."}'

OpenAI-Compatible Endpoints

# Auto-sizing with OpenAI-compatible API
curl -X POST http://localhost:11435/proxy-context/auto/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "max_tokens": 150
  }'

Health Check

# Check proxy status and available context sizes
curl http://localhost:11435/health

How Auto-Sizing Works

The proxy uses intelligent analysis to determine optimal context sizes:

  1. Content Analysis: Extracts and analyzes prompt text from various endpoint formats
  2. Token Estimation: Estimates input tokens using character-based approximation
  3. Buffer Calculation: Adds buffers for system prompts, response space, and safety margins
  4. Context Selection: Chooses the smallest available context that can handle the request

Available Context Sizes

  • 2K (2048 tokens): Short prompts, simple Q&A
  • 4K (4096 tokens): Medium prompts, code snippets
  • 8K (8192 tokens): Long prompts, detailed instructions
  • 16K (16384 tokens): Very long prompts, document analysis
  • 32K (32768 tokens): Maximum context, large documents

Auto-Sizing Logic

Total Required = Input Tokens + Max Response Tokens + System Overhead + Safety Margin
                      ↓                    ↓              ↓               ↓
                  Estimated from      From request    100 tokens     200 tokens
                  prompt content      max_tokens      buffer         buffer

Docker Compose Integration

Example docker-compose.yml integration:

version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama

  ollama-context-proxy:
    build: ./ollama-context-proxy
    ports:
      - "11435:11435"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama_data:

API Endpoints

Proxy Endpoints

Endpoint Pattern Description
/proxy-context/auto/{path} Auto-detect context size
/proxy-context/{size}/{path} Fixed context size (2048, 4096, 8192, 16384, 32768)
/health Health check and proxy status

Supported Ollama Endpoints

All standard Ollama endpoints are supported through the proxy:

  • /api/generate - Text generation
  • /api/chat - Chat completions
  • /api/tags - List models
  • /api/show - Model information
  • /v1/chat/completions - OpenAI-compatible chat
  • /v1/completions - OpenAI-compatible completions

Performance Benefits

Memory Usage Reduction

Using appropriate context sizes can significantly reduce GPU memory usage:

  • 2K context: ~1-2GB GPU memory
  • 4K context: ~2-4GB GPU memory
  • 8K context: ~4-8GB GPU memory
  • 16K context: ~8-16GB GPU memory
  • 32K context: ~16-32GB GPU memory

Response Time Improvement

Smaller contexts process faster:

  • Simple prompts: 2-3x faster with auto-sizing vs. fixed 32K
  • Medium prompts: 1.5-2x faster with optimal sizing
  • Large prompts: Minimal difference (uses large context anyway)

Monitoring and Logging

The proxy provides detailed logging for monitoring:

# Enable debug logging for detailed analysis
python3 ollama-context-proxy.py --log-level DEBUG

Log information includes:

  • Context size selection reasoning
  • Token estimation details
  • Request routing information
  • Performance metrics

Troubleshooting

Common Issues

Connection Refused

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Verify proxy configuration
curl http://localhost:11435/health

Context Size Warnings

Request may exceed largest available context!
  • The request requires more than 32K tokens
  • Consider breaking large prompts into smaller chunks
  • Use streaming for very long responses

Auto-sizing Not Working

  • Ensure you're using /proxy-context/auto/ in your URLs
  • Check request format matches supported endpoints
  • Enable DEBUG logging to see analysis details

Debug Mode

# Run with debug logging
python3 ollama-context-proxy.py --log-level DEBUG

# This will show:
# - Token estimation details
# - Context selection reasoning  
# - Request/response routing info

Development

Requirements

pip install aiohttp asyncio

Project Structure

ollama-context-proxy/
├── ollama-context-proxy.py    # Main proxy server
├── requirements.txt           # Python dependencies
├── Dockerfile                # Docker configuration
└── README.md                 # This file

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

[Add your license information here]

Support

  • Issues: Report bugs and feature requests via GitHub issues
  • Documentation: This README and inline code comments
  • Community: [Add community links if applicable]

Note: This proxy is designed to work transparently with existing Ollama clients. Simply change your Ollama URL from http://localhost:11434 to http://localhost:11435/proxy-context/auto to enable intelligent context sizing.