6.3 KiB
6.3 KiB
Multi-Peer Whisper ASR Architecture
Overview
The Whisper ASR system has been redesigned to handle multiple audio tracks from different WebRTC peers simultaneously, with proper speaker identification and isolated audio processing.
Architecture Changes
Before (Single AudioProcessor)
Peer A Audio → |
Peer B Audio → | → Single AudioProcessor → Mixed Transcription
Peer C Audio → |
Problems:
- Mixed audio streams from all speakers
- No speaker identification
- Poor transcription quality when multiple people speak
- Audio interference between speakers
After (Per-Peer AudioProcessor)
Peer A Audio → AudioProcessor A → "🎤 Alice: Hello there"
Peer B Audio → AudioProcessor B → "🎤 Bob: How are you?"
Peer C Audio → AudioProcessor C → "🎤 Charlie: Good morning"
Benefits:
- Isolated audio processing per speaker
- Clear speaker identification in transcriptions
- No audio interference between speakers
- Better transcription quality
- Scalable to many speakers
Key Components
1. Per-Peer Audio Processors
- Global Dictionary:
_audio_processors: Dict[str, AudioProcessor]
- Automatic Creation: New AudioProcessor created when peer connects
- Peer Identification: Each processor tagged with peer name
- Independent Processing: Separate audio buffers, queues, and transcription threads
2. Enhanced AudioProcessor Class
class AudioProcessor:
def __init__(self, peer_name: str, send_chat_func: Callable):
self.peer_name = peer_name # NEW: Peer identification
# ... rest of initialization
3. Speaker-Tagged Transcriptions
- Final transcriptions:
"🎤 Alice: Hello there"
- Partial transcriptions:
"🎤 Alice [partial]: Hello th..."
- Clear attribution: Always know who said what
4. Peer Management
- Connection: AudioProcessor created on first audio track
- Disconnection: Cleanup via
cleanup_peer_processor(peer_name)
- Status Monitoring:
get_active_processors()
for debugging
API Changes
New Functions
def cleanup_peer_processor(peer_name: str):
"""Clean up audio processor for disconnected peer."""
def get_active_processors() -> Dict[str, AudioProcessor]:
"""Get currently active audio processors."""
Modified Functions
# Old
AudioProcessor(send_chat_func)
# New
AudioProcessor(peer_name, send_chat_func)
Usage Examples
1. Multiple Speakers Scenario
# In a 3-person meeting:
🎤 Alice: I think we should start with the quarterly review
🎤 Bob [partial]: That sounds like a good...
🎤 Bob: That sounds like a good idea to me
🎤 Charlie: I agree, let's begin
2. Debugging Multiple Processors
# Check status of all active processors
python force_transcription.py stats
# Force transcription for all peers
python force_transcription.py
3. Monitoring Active Connections
from bots.whisper import get_active_processors
processors = get_active_processors()
print(f"Active speakers: {list(processors.keys())}")
Performance Considerations
Resource Usage
- Memory: Linear scaling with number of speakers
- CPU: Parallel processing threads (one per speaker)
- Model: Shared Whisper model across all processors (efficient)
Scalability
- Small groups (2-5 people): Excellent performance
- Medium groups (6-15 people): Good performance
- Large groups (15+ people): May need optimization
Optimization Strategies
- Silence Detection: Skip processing for quiet/inactive speakers
- Dynamic Cleanup: Remove processors for disconnected peers
- Configurable Thresholds: Adjust per-speaker sensitivity
- Resource Limits: Max concurrent processors if needed
Debugging Tools
1. Force Transcription (Enhanced)
# Shows status for all active peers
python force_transcription.py
# Output example:
🔍 Found 3 active audio processors:
👤 Alice:
- Running: True
- Buffer size: 5 frames
- Queue size: 1
- Current phrase length: 8000 samples
👤 Bob:
- Running: True
- Buffer size: 0 frames
- Queue size: 0
- Current phrase length: 0 samples
2. Audio Statistics (Per-Peer)
python force_transcription.py stats
# Shows detailed metrics for each peer
📊 Detailed Audio Statistics for 2 processors:
👤 Alice:
Sample rate: 16000Hz
Current buffer size: 3
Processing queue size: 0
Current phrase:
Duration: 1.25s
RMS: 0.0234
Peak: 0.1892
3. Enhanced Logging
INFO - Creating new AudioProcessor for Alice
INFO - AudioProcessor initialized for Alice - sample_rate: 16000Hz
INFO - ✅ Transcribed (final) for Alice: 'Hello everyone'
INFO - Cleaning up AudioProcessor for disconnected peer: Bob
Migration Guide
For Existing Code
- No changes needed for basic usage
- Enhanced debugging with per-peer information
- Better transcription quality automatically
For Advanced Usage
- Use
get_active_processors()
to monitor speakers - Call
cleanup_peer_processor()
on peer disconnect - Check peer-specific statistics in force_transcription.py
Error Handling
Common Issues
- No AudioProcessor for peer: Automatically created on first audio
- Peer disconnection: Manual cleanup recommended
- Resource exhaustion: Monitor with
get_active_processors()
Error Messages
ERROR - Cannot create AudioProcessor for Alice: no send_chat_func available
WARNING - No audio processor available to handle audio data for Bob
INFO - Cleaning up AudioProcessor for disconnected peer: Charlie
Future Enhancements
Planned Features
- Voice Activity Detection: Only process when speaker is active
- Speaker Diarization: Merge multiple audio sources per speaker
- Language Detection: Per-speaker language settings
- Quality Metrics: Per-speaker transcription confidence scores
Possible Optimizations
- Shared Processing: Batch multiple speakers in single inference
- Dynamic Model Loading: Different models per speaker/language
- Audio Mixing: Optional mixed transcription for meeting notes
- Real-time Adaptation: Adjust thresholds per speaker automatically
This new architecture provides a robust foundation for multi-speaker ASR with clear attribution, better quality, and comprehensive debugging capabilities.