James Ketrenos 8119cd8492 Added auto-context proxy

2025-07-31 15:55:14 -07:00

9.5 KiB

Raw Permalink Blame History

Ollama Context Proxy

A smart proxy server for Ollama that provides automatic context size detection and URL-based context routing. This proxy intelligently analyzes incoming requests to determine the optimal context window size, eliminating the need to manually configure context sizes for different types of prompts.

Why Ollama Context Proxy?

The Problem

Memory Efficiency: Large context windows consume significantly more GPU memory and processing time
Manual Configuration: Traditional setups require you to manually set context sizes for each request
One-Size-Fits-All: Most deployments use a fixed context size, wasting resources on small prompts or limiting large ones
Performance Impact: Using a 32K context for a simple 100-token prompt is inefficient

The Solution

Ollama Context Proxy solves these issues by:

🧠 Intelligent Auto-Sizing: Automatically analyzes prompt content and selects the optimal context size
🎯 Resource Optimization: Uses smaller contexts for small prompts, larger contexts only when needed
⚡ Performance Boost: Reduces memory usage and inference time for most requests
🔧 Flexible Routing: URL-based routing allows explicit context control when needed
🔄 Drop-in Replacement: Works as a transparent proxy - no client code changes required

Features

Automatic Context Detection: Analyzes prompts and automatically selects appropriate context sizes
URL-Based Routing: Explicit context control via URL paths (/proxy-context/4096/api/generate)
Multiple API Support: Works with Ollama native API and OpenAI-compatible endpoints
Streaming Support: Full support for streaming responses
Resource Optimization: Reduces memory usage by using appropriate context sizes
Docker Ready: Includes Docker configuration for easy deployment
Environment Variable Support: Configurable via OLLAMA_BASE_URL

Quick Start

Using Docker (Recommended)

# Build the Docker image
docker build -t ollama-context-proxy .

# Run with default settings (connects to ollama:11434)
docker run -p 11435:11435 ollama-context-proxy

# Run with custom Ollama URL
docker run -p 11435:11435 -e OLLAMA_BASE_URL=http://your-ollama-host:11434 ollama-context-proxy

Direct Python Usage

# Install dependencies
pip install -r requirements.txt

# Run with auto-detection of Ollama
python3 ollama-context-proxy.py

# Run with custom Ollama host
python3 ollama-context-proxy.py --ollama-host your-ollama-host --ollama-port 11434

Configuration

Environment Variables

Variable	Default	Description
`OLLAMA_BASE_URL`	`http://ollama:11434`	Full URL to Ollama server (Docker default)

Command Line Arguments

python3 ollama-context-proxy.py [OPTIONS]

Options:
  --ollama-host HOST     Ollama server host (default: localhost or from OLLAMA_BASE_URL)
  --ollama-port PORT     Ollama server port (default: 11434)
  --proxy-port PORT      Proxy server port (default: 11435)
  --log-level LEVEL      Log level: DEBUG, INFO, WARNING, ERROR (default: INFO)

Usage Examples

Automatic Context Sizing (Recommended)

The proxy automatically determines the best context size based on your prompt:

# Auto-sizing - proxy analyzes prompt and chooses optimal context
curl -X POST http://localhost:11435/proxy-context/auto/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "prompt": "Write a short story about a robot.",
    "stream": false
  }'

# Chat endpoint with auto-sizing
curl -X POST http://localhost:11435/proxy-context/auto/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Fixed Context Sizes

When you need explicit control over context size:

# Force 2K context for small prompts
curl -X POST http://localhost:11435/proxy-context/2048/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "llama2", "prompt": "Hello world"}'

# Force 16K context for large prompts
curl -X POST http://localhost:11435/proxy-context/16384/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "llama2", "prompt": "Your very long prompt here..."}'

OpenAI-Compatible Endpoints

# Auto-sizing with OpenAI-compatible API
curl -X POST http://localhost:11435/proxy-context/auto/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "max_tokens": 150
  }'

Health Check

# Check proxy status and available context sizes
curl http://localhost:11435/health

How Auto-Sizing Works

The proxy uses intelligent analysis to determine optimal context sizes:

Content Analysis: Extracts and analyzes prompt text from various endpoint formats
Token Estimation: Estimates input tokens using character-based approximation
Buffer Calculation: Adds buffers for system prompts, response space, and safety margins
Context Selection: Chooses the smallest available context that can handle the request

Available Context Sizes

2K (2048 tokens): Short prompts, simple Q&A
4K (4096 tokens): Medium prompts, code snippets
8K (8192 tokens): Long prompts, detailed instructions
16K (16384 tokens): Very long prompts, document analysis
32K (32768 tokens): Maximum context, large documents

Auto-Sizing Logic

Total Required = Input Tokens + Max Response Tokens + System Overhead + Safety Margin
                      ↓                    ↓              ↓               ↓
                  Estimated from      From request    100 tokens     200 tokens
                  prompt content      max_tokens      buffer         buffer

Docker Compose Integration

Example docker-compose.yml integration:

version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama

  ollama-context-proxy:
    build: ./ollama-context-proxy
    ports:
      - "11435:11435"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama_data:

API Endpoints

Proxy Endpoints

Endpoint Pattern	Description
`/proxy-context/auto/{path}`	Auto-detect context size
`/proxy-context/{size}/{path}`	Fixed context size (2048, 4096, 8192, 16384, 32768)
`/health`	Health check and proxy status

Supported Ollama Endpoints

All standard Ollama endpoints are supported through the proxy:

/api/generate - Text generation
/api/chat - Chat completions
/api/tags - List models
/api/show - Model information
/v1/chat/completions - OpenAI-compatible chat
/v1/completions - OpenAI-compatible completions

Performance Benefits

Memory Usage Reduction

Using appropriate context sizes can significantly reduce GPU memory usage:

2K context: ~1-2GB GPU memory
4K context: ~2-4GB GPU memory
8K context: ~4-8GB GPU memory
16K context: ~8-16GB GPU memory
32K context: ~16-32GB GPU memory

Response Time Improvement

Smaller contexts process faster:

Simple prompts: 2-3x faster with auto-sizing vs. fixed 32K
Medium prompts: 1.5-2x faster with optimal sizing
Large prompts: Minimal difference (uses large context anyway)

Monitoring and Logging

The proxy provides detailed logging for monitoring:

# Enable debug logging for detailed analysis
python3 ollama-context-proxy.py --log-level DEBUG

Log information includes:

Context size selection reasoning
Token estimation details
Request routing information
Performance metrics

Troubleshooting

Common Issues

Connection Refused

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Verify proxy configuration
curl http://localhost:11435/health

Context Size Warnings

Request may exceed largest available context!

The request requires more than 32K tokens
Consider breaking large prompts into smaller chunks
Use streaming for very long responses

Auto-sizing Not Working

Ensure you're using /proxy-context/auto/ in your URLs
Check request format matches supported endpoints
Enable DEBUG logging to see analysis details

Debug Mode

# Run with debug logging
python3 ollama-context-proxy.py --log-level DEBUG

# This will show:
# - Token estimation details
# - Context selection reasoning  
# - Request/response routing info

Development

Requirements

pip install aiohttp asyncio

Project Structure

ollama-context-proxy/
├── ollama-context-proxy.py    # Main proxy server
├── requirements.txt           # Python dependencies
├── Dockerfile                # Docker configuration
└── README.md                 # This file

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

[Add your license information here]

Support

Issues: Report bugs and feature requests via GitHub issues
Documentation: This README and inline code comments
Community: [Add community links if applicable]

Note: This proxy is designed to work transparently with existing Ollama clients. Simply change your Ollama URL from http://localhost:11434 to http://localhost:11435/proxy-context/auto to enable intelligent context sizing.

9.5 KiB Raw Permalink Blame History