ai-voicebot/docs/MULTI_PEER_WHISPER_ARCHITECTURE.md

# Multi-Peer Whisper ASR Architecture

## Overview

The Whisper ASR system has been redesigned to handle multiple audio tracks from different WebRTC peers simultaneously, with proper speaker identification and isolated audio processing.

## Architecture Changes

### Before (Single AudioProcessor)
```
Peer A Audio → |
Peer B Audio → | → Single AudioProcessor → Mixed Transcription
Peer C Audio → |
```

**Problems:**
- Mixed audio streams from all speakers
- No speaker identification
- Poor transcription quality when multiple people speak
- Audio interference between speakers

### After (Per-Peer AudioProcessor)
```
Peer A Audio → AudioProcessor A → "🎤 Alice: Hello there"
Peer B Audio → AudioProcessor B → "🎤 Bob: How are you?"
Peer C Audio → AudioProcessor C → "🎤 Charlie: Good morning"
```

**Benefits:**
- Isolated audio processing per speaker
- Clear speaker identification in transcriptions
- No audio interference between speakers
- Better transcription quality
- Scalable to many speakers

## Key Components

### 1. Per-Peer Audio Processors
- **Global Dictionary**: `_audio_processors: Dict[str, AudioProcessor]`
- **Automatic Creation**: New AudioProcessor created when peer connects
- **Peer Identification**: Each processor tagged with peer name
- **Independent Processing**: Separate audio buffers, queues, and transcription threads

### 2. Enhanced AudioProcessor Class
```python
class AudioProcessor:
    def __init__(self, peer_name: str, send_chat_func: Callable):
        self.peer_name = peer_name  # NEW: Peer identification
        # ... rest of initialization
```

### 3. Speaker-Tagged Transcriptions
- **Final transcriptions**: `"🎤 Alice: Hello there"`
- **Partial transcriptions**: `"🎤 Alice [partial]: Hello th..."`
- **Clear attribution**: Always know who said what

### 4. Peer Management
- **Connection**: AudioProcessor created on first audio track
- **Disconnection**: Cleanup via `cleanup_peer_processor(peer_name)`
- **Status Monitoring**: `get_active_processors()` for debugging

## API Changes

### New Functions
```python
def cleanup_peer_processor(peer_name: str):
    """Clean up audio processor for disconnected peer."""

def get_active_processors() -> Dict[str, AudioProcessor]:
    """Get currently active audio processors."""
```

### Modified Functions
```python
# Old
AudioProcessor(send_chat_func)

# New
AudioProcessor(peer_name, send_chat_func)
```

## Usage Examples

### 1. Multiple Speakers Scenario
```
# In a 3-person meeting:
🎤 Alice: I think we should start with the quarterly review
🎤 Bob [partial]: That sounds like a good...
🎤 Bob: That sounds like a good idea to me
🎤 Charlie: I agree, let's begin
```

### 2. Debugging Multiple Processors
```bash
# Check status of all active processors
python force_transcription.py stats

# Force transcription for all peers
python force_transcription.py
```

### 3. Monitoring Active Connections
```python
from bots.whisper import get_active_processors

processors = get_active_processors()
print(f"Active speakers: {list(processors.keys())}")
```

## Performance Considerations

### Resource Usage
- **Memory**: Linear scaling with number of speakers
- **CPU**: Parallel processing threads (one per speaker)
- **Model**: Shared Whisper model across all processors (efficient)

### Scalability
- **Small groups (2-5 people)**: Excellent performance
- **Medium groups (6-15 people)**: Good performance
- **Large groups (15+ people)**: May need optimization

### Optimization Strategies
1. **Silence Detection**: Skip processing for quiet/inactive speakers
2. **Dynamic Cleanup**: Remove processors for disconnected peers
3. **Configurable Thresholds**: Adjust per-speaker sensitivity
4. **Resource Limits**: Max concurrent processors if needed

## Debugging Tools

### 1. Force Transcription (Enhanced)
```bash
# Shows status for all active peers
python force_transcription.py

# Output example:
🔍 Found 3 active audio processors:

👤 Alice:
  - Running: True
  - Buffer size: 5 frames
  - Queue size: 1
  - Current phrase length: 8000 samples

👤 Bob:
  - Running: True
  - Buffer size: 0 frames
  - Queue size: 0
  - Current phrase length: 0 samples
```

### 2. Audio Statistics (Per-Peer)
```bash
python force_transcription.py stats

# Shows detailed metrics for each peer
📊 Detailed Audio Statistics for 2 processors:

👤 Alice:
Sample rate: 16000Hz
Current buffer size: 3
Processing queue size: 0
  Current phrase:
    Duration: 1.25s
    RMS: 0.0234
    Peak: 0.1892
```

### 3. Enhanced Logging
```
INFO - Creating new AudioProcessor for Alice
INFO - AudioProcessor initialized for Alice - sample_rate: 16000Hz
INFO - ✅ Transcribed (final) for Alice: 'Hello everyone'
INFO - Cleaning up AudioProcessor for disconnected peer: Bob
```

## Migration Guide

### For Existing Code
- **No changes needed** for basic usage
- **Enhanced debugging** with per-peer information
- **Better transcription quality** automatically

### For Advanced Usage
- Use `get_active_processors()` to monitor speakers
- Call `cleanup_peer_processor()` on peer disconnect
- Check peer-specific statistics in force_transcription.py

## Error Handling

### Common Issues
1. **No AudioProcessor for peer**: Automatically created on first audio
2. **Peer disconnection**: Manual cleanup recommended
3. **Resource exhaustion**: Monitor with `get_active_processors()`

### Error Messages
```
ERROR - Cannot create AudioProcessor for Alice: no send_chat_func available
WARNING - No audio processor available to handle audio data for Bob
INFO - Cleaning up AudioProcessor for disconnected peer: Charlie
```

## Future Enhancements

### Planned Features
1. **Voice Activity Detection**: Only process when speaker is active
2. **Speaker Diarization**: Merge multiple audio sources per speaker
3. **Language Detection**: Per-speaker language settings
4. **Quality Metrics**: Per-speaker transcription confidence scores

### Possible Optimizations
1. **Shared Processing**: Batch multiple speakers in single inference
2. **Dynamic Model Loading**: Different models per speaker/language
3. **Audio Mixing**: Optional mixed transcription for meeting notes
4. **Real-time Adaptation**: Adjust thresholds per speaker automatically

This new architecture provides a robust foundation for multi-speaker ASR with clear attribution, better quality, and comprehensive debugging capabilities.