217 lines
6.3 KiB
Markdown
217 lines
6.3 KiB
Markdown
# Multi-Peer Whisper ASR Architecture
|
|
|
|
## Overview
|
|
|
|
The Whisper ASR system has been redesigned to handle multiple audio tracks from different WebRTC peers simultaneously, with proper speaker identification and isolated audio processing.
|
|
|
|
## Architecture Changes
|
|
|
|
### Before (Single AudioProcessor)
|
|
```
|
|
Peer A Audio → |
|
|
Peer B Audio → | → Single AudioProcessor → Mixed Transcription
|
|
Peer C Audio → |
|
|
```
|
|
|
|
**Problems:**
|
|
- Mixed audio streams from all speakers
|
|
- No speaker identification
|
|
- Poor transcription quality when multiple people speak
|
|
- Audio interference between speakers
|
|
|
|
### After (Per-Peer AudioProcessor)
|
|
```
|
|
Peer A Audio → AudioProcessor A → "🎤 Alice: Hello there"
|
|
Peer B Audio → AudioProcessor B → "🎤 Bob: How are you?"
|
|
Peer C Audio → AudioProcessor C → "🎤 Charlie: Good morning"
|
|
```
|
|
|
|
**Benefits:**
|
|
- Isolated audio processing per speaker
|
|
- Clear speaker identification in transcriptions
|
|
- No audio interference between speakers
|
|
- Better transcription quality
|
|
- Scalable to many speakers
|
|
|
|
## Key Components
|
|
|
|
### 1. Per-Peer Audio Processors
|
|
- **Global Dictionary**: `_audio_processors: Dict[str, AudioProcessor]`
|
|
- **Automatic Creation**: New AudioProcessor created when peer connects
|
|
- **Peer Identification**: Each processor tagged with peer name
|
|
- **Independent Processing**: Separate audio buffers, queues, and transcription threads
|
|
|
|
### 2. Enhanced AudioProcessor Class
|
|
```python
|
|
class AudioProcessor:
|
|
def __init__(self, peer_name: str, send_chat_func: Callable):
|
|
self.peer_name = peer_name # NEW: Peer identification
|
|
# ... rest of initialization
|
|
```
|
|
|
|
### 3. Speaker-Tagged Transcriptions
|
|
- **Final transcriptions**: `"🎤 Alice: Hello there"`
|
|
- **Partial transcriptions**: `"🎤 Alice [partial]: Hello th..."`
|
|
- **Clear attribution**: Always know who said what
|
|
|
|
### 4. Peer Management
|
|
- **Connection**: AudioProcessor created on first audio track
|
|
- **Disconnection**: Cleanup via `cleanup_peer_processor(peer_name)`
|
|
- **Status Monitoring**: `get_active_processors()` for debugging
|
|
|
|
## API Changes
|
|
|
|
### New Functions
|
|
```python
|
|
def cleanup_peer_processor(peer_name: str):
|
|
"""Clean up audio processor for disconnected peer."""
|
|
|
|
def get_active_processors() -> Dict[str, AudioProcessor]:
|
|
"""Get currently active audio processors."""
|
|
```
|
|
|
|
### Modified Functions
|
|
```python
|
|
# Old
|
|
AudioProcessor(send_chat_func)
|
|
|
|
# New
|
|
AudioProcessor(peer_name, send_chat_func)
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
### 1. Multiple Speakers Scenario
|
|
```
|
|
# In a 3-person meeting:
|
|
🎤 Alice: I think we should start with the quarterly review
|
|
🎤 Bob [partial]: That sounds like a good...
|
|
🎤 Bob: That sounds like a good idea to me
|
|
🎤 Charlie: I agree, let's begin
|
|
```
|
|
|
|
### 2. Debugging Multiple Processors
|
|
```bash
|
|
# Check status of all active processors
|
|
python force_transcription.py stats
|
|
|
|
# Force transcription for all peers
|
|
python force_transcription.py
|
|
```
|
|
|
|
### 3. Monitoring Active Connections
|
|
```python
|
|
from bots.whisper import get_active_processors
|
|
|
|
processors = get_active_processors()
|
|
print(f"Active speakers: {list(processors.keys())}")
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
### Resource Usage
|
|
- **Memory**: Linear scaling with number of speakers
|
|
- **CPU**: Parallel processing threads (one per speaker)
|
|
- **Model**: Shared Whisper model across all processors (efficient)
|
|
|
|
### Scalability
|
|
- **Small groups (2-5 people)**: Excellent performance
|
|
- **Medium groups (6-15 people)**: Good performance
|
|
- **Large groups (15+ people)**: May need optimization
|
|
|
|
### Optimization Strategies
|
|
1. **Silence Detection**: Skip processing for quiet/inactive speakers
|
|
2. **Dynamic Cleanup**: Remove processors for disconnected peers
|
|
3. **Configurable Thresholds**: Adjust per-speaker sensitivity
|
|
4. **Resource Limits**: Max concurrent processors if needed
|
|
|
|
## Debugging Tools
|
|
|
|
### 1. Force Transcription (Enhanced)
|
|
```bash
|
|
# Shows status for all active peers
|
|
python force_transcription.py
|
|
|
|
# Output example:
|
|
🔍 Found 3 active audio processors:
|
|
|
|
👤 Alice:
|
|
- Running: True
|
|
- Buffer size: 5 frames
|
|
- Queue size: 1
|
|
- Current phrase length: 8000 samples
|
|
|
|
👤 Bob:
|
|
- Running: True
|
|
- Buffer size: 0 frames
|
|
- Queue size: 0
|
|
- Current phrase length: 0 samples
|
|
```
|
|
|
|
### 2. Audio Statistics (Per-Peer)
|
|
```bash
|
|
python force_transcription.py stats
|
|
|
|
# Shows detailed metrics for each peer
|
|
📊 Detailed Audio Statistics for 2 processors:
|
|
|
|
👤 Alice:
|
|
Sample rate: 16000Hz
|
|
Current buffer size: 3
|
|
Processing queue size: 0
|
|
Current phrase:
|
|
Duration: 1.25s
|
|
RMS: 0.0234
|
|
Peak: 0.1892
|
|
```
|
|
|
|
### 3. Enhanced Logging
|
|
```
|
|
INFO - Creating new AudioProcessor for Alice
|
|
INFO - AudioProcessor initialized for Alice - sample_rate: 16000Hz
|
|
INFO - ✅ Transcribed (final) for Alice: 'Hello everyone'
|
|
INFO - Cleaning up AudioProcessor for disconnected peer: Bob
|
|
```
|
|
|
|
## Migration Guide
|
|
|
|
### For Existing Code
|
|
- **No changes needed** for basic usage
|
|
- **Enhanced debugging** with per-peer information
|
|
- **Better transcription quality** automatically
|
|
|
|
### For Advanced Usage
|
|
- Use `get_active_processors()` to monitor speakers
|
|
- Call `cleanup_peer_processor()` on peer disconnect
|
|
- Check peer-specific statistics in force_transcription.py
|
|
|
|
## Error Handling
|
|
|
|
### Common Issues
|
|
1. **No AudioProcessor for peer**: Automatically created on first audio
|
|
2. **Peer disconnection**: Manual cleanup recommended
|
|
3. **Resource exhaustion**: Monitor with `get_active_processors()`
|
|
|
|
### Error Messages
|
|
```
|
|
ERROR - Cannot create AudioProcessor for Alice: no send_chat_func available
|
|
WARNING - No audio processor available to handle audio data for Bob
|
|
INFO - Cleaning up AudioProcessor for disconnected peer: Charlie
|
|
```
|
|
|
|
## Future Enhancements
|
|
|
|
### Planned Features
|
|
1. **Voice Activity Detection**: Only process when speaker is active
|
|
2. **Speaker Diarization**: Merge multiple audio sources per speaker
|
|
3. **Language Detection**: Per-speaker language settings
|
|
4. **Quality Metrics**: Per-speaker transcription confidence scores
|
|
|
|
### Possible Optimizations
|
|
1. **Shared Processing**: Batch multiple speakers in single inference
|
|
2. **Dynamic Model Loading**: Different models per speaker/language
|
|
3. **Audio Mixing**: Optional mixed transcription for meeting notes
|
|
4. **Real-time Adaptation**: Adjust thresholds per speaker automatically
|
|
|
|
This new architecture provides a robust foundation for multi-speaker ASR with clear attribution, better quality, and comprehensive debugging capabilities.
|