James Ketrenos 39739e5d34 Moved docs into docs/

2025-09-09 14:16:40 -07:00

6.3 KiB

Raw Permalink Blame History

Multi-Peer Whisper ASR Architecture

Overview

The Whisper ASR system has been redesigned to handle multiple audio tracks from different WebRTC peers simultaneously, with proper speaker identification and isolated audio processing.

Architecture Changes

Before (Single AudioProcessor)

Peer A Audio → |
Peer B Audio → | → Single AudioProcessor → Mixed Transcription
Peer C Audio → |

Problems:

Mixed audio streams from all speakers
No speaker identification
Poor transcription quality when multiple people speak
Audio interference between speakers

After (Per-Peer AudioProcessor)

Peer A Audio → AudioProcessor A → "🎤 Alice: Hello there"
Peer B Audio → AudioProcessor B → "🎤 Bob: How are you?"
Peer C Audio → AudioProcessor C → "🎤 Charlie: Good morning"

Benefits:

Isolated audio processing per speaker
Clear speaker identification in transcriptions
No audio interference between speakers
Better transcription quality
Scalable to many speakers

Key Components

1. Per-Peer Audio Processors

Global Dictionary: _audio_processors: Dict[str, AudioProcessor]
Automatic Creation: New AudioProcessor created when peer connects
Peer Identification: Each processor tagged with peer name
Independent Processing: Separate audio buffers, queues, and transcription threads

2. Enhanced AudioProcessor Class

class AudioProcessor:
    def __init__(self, peer_name: str, send_chat_func: Callable):
        self.peer_name = peer_name  # NEW: Peer identification
        # ... rest of initialization

3. Speaker-Tagged Transcriptions

Final transcriptions: "🎤 Alice: Hello there"
Partial transcriptions: "🎤 Alice [partial]: Hello th..."
Clear attribution: Always know who said what

4. Peer Management

Connection: AudioProcessor created on first audio track
Disconnection: Cleanup via cleanup_peer_processor(peer_name)
Status Monitoring: get_active_processors() for debugging

API Changes

New Functions

def cleanup_peer_processor(peer_name: str):
    """Clean up audio processor for disconnected peer."""

def get_active_processors() -> Dict[str, AudioProcessor]:
    """Get currently active audio processors."""

Modified Functions

# Old
AudioProcessor(send_chat_func)

# New
AudioProcessor(peer_name, send_chat_func)

Usage Examples

1. Multiple Speakers Scenario

# In a 3-person meeting:
🎤 Alice: I think we should start with the quarterly review
🎤 Bob [partial]: That sounds like a good...
🎤 Bob: That sounds like a good idea to me
🎤 Charlie: I agree, let's begin

2. Debugging Multiple Processors

# Check status of all active processors
python force_transcription.py stats

# Force transcription for all peers
python force_transcription.py

3. Monitoring Active Connections

from bots.whisper import get_active_processors

processors = get_active_processors()
print(f"Active speakers: {list(processors.keys())}")

Performance Considerations

Resource Usage

Memory: Linear scaling with number of speakers
CPU: Parallel processing threads (one per speaker)
Model: Shared Whisper model across all processors (efficient)

Scalability

Small groups (2-5 people): Excellent performance
Medium groups (6-15 people): Good performance
Large groups (15+ people): May need optimization

Optimization Strategies

Silence Detection: Skip processing for quiet/inactive speakers
Dynamic Cleanup: Remove processors for disconnected peers
Configurable Thresholds: Adjust per-speaker sensitivity
Resource Limits: Max concurrent processors if needed

Debugging Tools

1. Force Transcription (Enhanced)

# Shows status for all active peers
python force_transcription.py

# Output example:
🔍 Found 3 active audio processors:

👤 Alice:
  - Running: True
  - Buffer size: 5 frames
  - Queue size: 1
  - Current phrase length: 8000 samples

👤 Bob:
  - Running: True  
  - Buffer size: 0 frames
  - Queue size: 0
  - Current phrase length: 0 samples

2. Audio Statistics (Per-Peer)

python force_transcription.py stats

# Shows detailed metrics for each peer
📊 Detailed Audio Statistics for 2 processors:

👤 Alice:
Sample rate: 16000Hz
Current buffer size: 3
Processing queue size: 0
  Current phrase:
    Duration: 1.25s
    RMS: 0.0234
    Peak: 0.1892

3. Enhanced Logging

INFO - Creating new AudioProcessor for Alice
INFO - AudioProcessor initialized for Alice - sample_rate: 16000Hz
INFO - ✅ Transcribed (final) for Alice: 'Hello everyone'
INFO - Cleaning up AudioProcessor for disconnected peer: Bob

Migration Guide

For Existing Code

No changes needed for basic usage
Enhanced debugging with per-peer information
Better transcription quality automatically

For Advanced Usage

Use get_active_processors() to monitor speakers
Call cleanup_peer_processor() on peer disconnect
Check peer-specific statistics in force_transcription.py

Error Handling

Common Issues

No AudioProcessor for peer: Automatically created on first audio
Peer disconnection: Manual cleanup recommended
Resource exhaustion: Monitor with get_active_processors()

Error Messages

ERROR - Cannot create AudioProcessor for Alice: no send_chat_func available
WARNING - No audio processor available to handle audio data for Bob
INFO - Cleaning up AudioProcessor for disconnected peer: Charlie

Future Enhancements

Planned Features

Voice Activity Detection: Only process when speaker is active
Speaker Diarization: Merge multiple audio sources per speaker
Language Detection: Per-speaker language settings
Quality Metrics: Per-speaker transcription confidence scores

Possible Optimizations

Shared Processing: Batch multiple speakers in single inference
Dynamic Model Loading: Different models per speaker/language
Audio Mixing: Optional mixed transcription for meeting notes
Real-time Adaptation: Adjust thresholds per speaker automatically

This new architecture provides a robust foundation for multi-speaker ASR with clear attribution, better quality, and comprehensive debugging capabilities.

6.3 KiB Raw Permalink Blame History