ai-voicebot/docs/MULTI_PEER_WHISPER_ARCHITECTURE.md

217 lines
6.3 KiB
Markdown

# Multi-Peer Whisper ASR Architecture
## Overview
The Whisper ASR system has been redesigned to handle multiple audio tracks from different WebRTC peers simultaneously, with proper speaker identification and isolated audio processing.
## Architecture Changes
### Before (Single AudioProcessor)
```
Peer A Audio → |
Peer B Audio → | → Single AudioProcessor → Mixed Transcription
Peer C Audio → |
```
**Problems:**
- Mixed audio streams from all speakers
- No speaker identification
- Poor transcription quality when multiple people speak
- Audio interference between speakers
### After (Per-Peer AudioProcessor)
```
Peer A Audio → AudioProcessor A → "🎤 Alice: Hello there"
Peer B Audio → AudioProcessor B → "🎤 Bob: How are you?"
Peer C Audio → AudioProcessor C → "🎤 Charlie: Good morning"
```
**Benefits:**
- Isolated audio processing per speaker
- Clear speaker identification in transcriptions
- No audio interference between speakers
- Better transcription quality
- Scalable to many speakers
## Key Components
### 1. Per-Peer Audio Processors
- **Global Dictionary**: `_audio_processors: Dict[str, AudioProcessor]`
- **Automatic Creation**: New AudioProcessor created when peer connects
- **Peer Identification**: Each processor tagged with peer name
- **Independent Processing**: Separate audio buffers, queues, and transcription threads
### 2. Enhanced AudioProcessor Class
```python
class AudioProcessor:
def __init__(self, peer_name: str, send_chat_func: Callable):
self.peer_name = peer_name # NEW: Peer identification
# ... rest of initialization
```
### 3. Speaker-Tagged Transcriptions
- **Final transcriptions**: `"🎤 Alice: Hello there"`
- **Partial transcriptions**: `"🎤 Alice [partial]: Hello th..."`
- **Clear attribution**: Always know who said what
### 4. Peer Management
- **Connection**: AudioProcessor created on first audio track
- **Disconnection**: Cleanup via `cleanup_peer_processor(peer_name)`
- **Status Monitoring**: `get_active_processors()` for debugging
## API Changes
### New Functions
```python
def cleanup_peer_processor(peer_name: str):
"""Clean up audio processor for disconnected peer."""
def get_active_processors() -> Dict[str, AudioProcessor]:
"""Get currently active audio processors."""
```
### Modified Functions
```python
# Old
AudioProcessor(send_chat_func)
# New
AudioProcessor(peer_name, send_chat_func)
```
## Usage Examples
### 1. Multiple Speakers Scenario
```
# In a 3-person meeting:
🎤 Alice: I think we should start with the quarterly review
🎤 Bob [partial]: That sounds like a good...
🎤 Bob: That sounds like a good idea to me
🎤 Charlie: I agree, let's begin
```
### 2. Debugging Multiple Processors
```bash
# Check status of all active processors
python force_transcription.py stats
# Force transcription for all peers
python force_transcription.py
```
### 3. Monitoring Active Connections
```python
from bots.whisper import get_active_processors
processors = get_active_processors()
print(f"Active speakers: {list(processors.keys())}")
```
## Performance Considerations
### Resource Usage
- **Memory**: Linear scaling with number of speakers
- **CPU**: Parallel processing threads (one per speaker)
- **Model**: Shared Whisper model across all processors (efficient)
### Scalability
- **Small groups (2-5 people)**: Excellent performance
- **Medium groups (6-15 people)**: Good performance
- **Large groups (15+ people)**: May need optimization
### Optimization Strategies
1. **Silence Detection**: Skip processing for quiet/inactive speakers
2. **Dynamic Cleanup**: Remove processors for disconnected peers
3. **Configurable Thresholds**: Adjust per-speaker sensitivity
4. **Resource Limits**: Max concurrent processors if needed
## Debugging Tools
### 1. Force Transcription (Enhanced)
```bash
# Shows status for all active peers
python force_transcription.py
# Output example:
🔍 Found 3 active audio processors:
👤 Alice:
- Running: True
- Buffer size: 5 frames
- Queue size: 1
- Current phrase length: 8000 samples
👤 Bob:
- Running: True
- Buffer size: 0 frames
- Queue size: 0
- Current phrase length: 0 samples
```
### 2. Audio Statistics (Per-Peer)
```bash
python force_transcription.py stats
# Shows detailed metrics for each peer
📊 Detailed Audio Statistics for 2 processors:
👤 Alice:
Sample rate: 16000Hz
Current buffer size: 3
Processing queue size: 0
Current phrase:
Duration: 1.25s
RMS: 0.0234
Peak: 0.1892
```
### 3. Enhanced Logging
```
INFO - Creating new AudioProcessor for Alice
INFO - AudioProcessor initialized for Alice - sample_rate: 16000Hz
INFO - ✅ Transcribed (final) for Alice: 'Hello everyone'
INFO - Cleaning up AudioProcessor for disconnected peer: Bob
```
## Migration Guide
### For Existing Code
- **No changes needed** for basic usage
- **Enhanced debugging** with per-peer information
- **Better transcription quality** automatically
### For Advanced Usage
- Use `get_active_processors()` to monitor speakers
- Call `cleanup_peer_processor()` on peer disconnect
- Check peer-specific statistics in force_transcription.py
## Error Handling
### Common Issues
1. **No AudioProcessor for peer**: Automatically created on first audio
2. **Peer disconnection**: Manual cleanup recommended
3. **Resource exhaustion**: Monitor with `get_active_processors()`
### Error Messages
```
ERROR - Cannot create AudioProcessor for Alice: no send_chat_func available
WARNING - No audio processor available to handle audio data for Bob
INFO - Cleaning up AudioProcessor for disconnected peer: Charlie
```
## Future Enhancements
### Planned Features
1. **Voice Activity Detection**: Only process when speaker is active
2. **Speaker Diarization**: Merge multiple audio sources per speaker
3. **Language Detection**: Per-speaker language settings
4. **Quality Metrics**: Per-speaker transcription confidence scores
### Possible Optimizations
1. **Shared Processing**: Batch multiple speakers in single inference
2. **Dynamic Model Loading**: Different models per speaker/language
3. **Audio Mixing**: Optional mixed transcription for meeting notes
4. **Real-time Adaptation**: Adjust thresholds per speaker automatically
This new architecture provides a robust foundation for multi-speaker ASR with clear attribution, better quality, and comprehensive debugging capabilities.