# Multi-Peer Whisper ASR Architecture ## Overview The Whisper ASR system has been redesigned to handle multiple audio tracks from different WebRTC peers simultaneously, with proper speaker identification and isolated audio processing. ## Architecture Changes ### Before (Single AudioProcessor) ``` Peer A Audio → | Peer B Audio → | → Single AudioProcessor → Mixed Transcription Peer C Audio → | ``` **Problems:** - Mixed audio streams from all speakers - No speaker identification - Poor transcription quality when multiple people speak - Audio interference between speakers ### After (Per-Peer AudioProcessor) ``` Peer A Audio → AudioProcessor A → "🎤 Alice: Hello there" Peer B Audio → AudioProcessor B → "🎤 Bob: How are you?" Peer C Audio → AudioProcessor C → "🎤 Charlie: Good morning" ``` **Benefits:** - Isolated audio processing per speaker - Clear speaker identification in transcriptions - No audio interference between speakers - Better transcription quality - Scalable to many speakers ## Key Components ### 1. Per-Peer Audio Processors - **Global Dictionary**: `_audio_processors: Dict[str, AudioProcessor]` - **Automatic Creation**: New AudioProcessor created when peer connects - **Peer Identification**: Each processor tagged with peer name - **Independent Processing**: Separate audio buffers, queues, and transcription threads ### 2. Enhanced AudioProcessor Class ```python class AudioProcessor: def __init__(self, peer_name: str, send_chat_func: Callable): self.peer_name = peer_name # NEW: Peer identification # ... rest of initialization ``` ### 3. Speaker-Tagged Transcriptions - **Final transcriptions**: `"🎤 Alice: Hello there"` - **Partial transcriptions**: `"🎤 Alice [partial]: Hello th..."` - **Clear attribution**: Always know who said what ### 4. Peer Management - **Connection**: AudioProcessor created on first audio track - **Disconnection**: Cleanup via `cleanup_peer_processor(peer_name)` - **Status Monitoring**: `get_active_processors()` for debugging ## API Changes ### New Functions ```python def cleanup_peer_processor(peer_name: str): """Clean up audio processor for disconnected peer.""" def get_active_processors() -> Dict[str, AudioProcessor]: """Get currently active audio processors.""" ``` ### Modified Functions ```python # Old AudioProcessor(send_chat_func) # New AudioProcessor(peer_name, send_chat_func) ``` ## Usage Examples ### 1. Multiple Speakers Scenario ``` # In a 3-person meeting: 🎤 Alice: I think we should start with the quarterly review 🎤 Bob [partial]: That sounds like a good... 🎤 Bob: That sounds like a good idea to me 🎤 Charlie: I agree, let's begin ``` ### 2. Debugging Multiple Processors ```bash # Check status of all active processors python force_transcription.py stats # Force transcription for all peers python force_transcription.py ``` ### 3. Monitoring Active Connections ```python from bots.whisper import get_active_processors processors = get_active_processors() print(f"Active speakers: {list(processors.keys())}") ``` ## Performance Considerations ### Resource Usage - **Memory**: Linear scaling with number of speakers - **CPU**: Parallel processing threads (one per speaker) - **Model**: Shared Whisper model across all processors (efficient) ### Scalability - **Small groups (2-5 people)**: Excellent performance - **Medium groups (6-15 people)**: Good performance - **Large groups (15+ people)**: May need optimization ### Optimization Strategies 1. **Silence Detection**: Skip processing for quiet/inactive speakers 2. **Dynamic Cleanup**: Remove processors for disconnected peers 3. **Configurable Thresholds**: Adjust per-speaker sensitivity 4. **Resource Limits**: Max concurrent processors if needed ## Debugging Tools ### 1. Force Transcription (Enhanced) ```bash # Shows status for all active peers python force_transcription.py # Output example: 🔍 Found 3 active audio processors: 👤 Alice: - Running: True - Buffer size: 5 frames - Queue size: 1 - Current phrase length: 8000 samples 👤 Bob: - Running: True - Buffer size: 0 frames - Queue size: 0 - Current phrase length: 0 samples ``` ### 2. Audio Statistics (Per-Peer) ```bash python force_transcription.py stats # Shows detailed metrics for each peer 📊 Detailed Audio Statistics for 2 processors: 👤 Alice: Sample rate: 16000Hz Current buffer size: 3 Processing queue size: 0 Current phrase: Duration: 1.25s RMS: 0.0234 Peak: 0.1892 ``` ### 3. Enhanced Logging ``` INFO - Creating new AudioProcessor for Alice INFO - AudioProcessor initialized for Alice - sample_rate: 16000Hz INFO - ✅ Transcribed (final) for Alice: 'Hello everyone' INFO - Cleaning up AudioProcessor for disconnected peer: Bob ``` ## Migration Guide ### For Existing Code - **No changes needed** for basic usage - **Enhanced debugging** with per-peer information - **Better transcription quality** automatically ### For Advanced Usage - Use `get_active_processors()` to monitor speakers - Call `cleanup_peer_processor()` on peer disconnect - Check peer-specific statistics in force_transcription.py ## Error Handling ### Common Issues 1. **No AudioProcessor for peer**: Automatically created on first audio 2. **Peer disconnection**: Manual cleanup recommended 3. **Resource exhaustion**: Monitor with `get_active_processors()` ### Error Messages ``` ERROR - Cannot create AudioProcessor for Alice: no send_chat_func available WARNING - No audio processor available to handle audio data for Bob INFO - Cleaning up AudioProcessor for disconnected peer: Charlie ``` ## Future Enhancements ### Planned Features 1. **Voice Activity Detection**: Only process when speaker is active 2. **Speaker Diarization**: Merge multiple audio sources per speaker 3. **Language Detection**: Per-speaker language settings 4. **Quality Metrics**: Per-speaker transcription confidence scores ### Possible Optimizations 1. **Shared Processing**: Batch multiple speakers in single inference 2. **Dynamic Model Loading**: Different models per speaker/language 3. **Audio Mixing**: Optional mixed transcription for meeting notes 4. **Real-time Adaptation**: Adjust thresholds per speaker automatically This new architecture provides a robust foundation for multi-speaker ASR with clear attribution, better quality, and comprehensive debugging capabilities.