ai-voicebot/docs/MULTI_PEER_WHISPER_ARCHITECTURE.md

6.3 KiB

Multi-Peer Whisper ASR Architecture

Overview

The Whisper ASR system has been redesigned to handle multiple audio tracks from different WebRTC peers simultaneously, with proper speaker identification and isolated audio processing.

Architecture Changes

Before (Single AudioProcessor)

Peer A Audio → |
Peer B Audio → | → Single AudioProcessor → Mixed Transcription
Peer C Audio → |

Problems:

  • Mixed audio streams from all speakers
  • No speaker identification
  • Poor transcription quality when multiple people speak
  • Audio interference between speakers

After (Per-Peer AudioProcessor)

Peer A Audio → AudioProcessor A → "🎤 Alice: Hello there"
Peer B Audio → AudioProcessor B → "🎤 Bob: How are you?"
Peer C Audio → AudioProcessor C → "🎤 Charlie: Good morning"

Benefits:

  • Isolated audio processing per speaker
  • Clear speaker identification in transcriptions
  • No audio interference between speakers
  • Better transcription quality
  • Scalable to many speakers

Key Components

1. Per-Peer Audio Processors

  • Global Dictionary: _audio_processors: Dict[str, AudioProcessor]
  • Automatic Creation: New AudioProcessor created when peer connects
  • Peer Identification: Each processor tagged with peer name
  • Independent Processing: Separate audio buffers, queues, and transcription threads

2. Enhanced AudioProcessor Class

class AudioProcessor:
    def __init__(self, peer_name: str, send_chat_func: Callable):
        self.peer_name = peer_name  # NEW: Peer identification
        # ... rest of initialization

3. Speaker-Tagged Transcriptions

  • Final transcriptions: "🎤 Alice: Hello there"
  • Partial transcriptions: "🎤 Alice [partial]: Hello th..."
  • Clear attribution: Always know who said what

4. Peer Management

  • Connection: AudioProcessor created on first audio track
  • Disconnection: Cleanup via cleanup_peer_processor(peer_name)
  • Status Monitoring: get_active_processors() for debugging

API Changes

New Functions

def cleanup_peer_processor(peer_name: str):
    """Clean up audio processor for disconnected peer."""

def get_active_processors() -> Dict[str, AudioProcessor]:
    """Get currently active audio processors."""

Modified Functions

# Old
AudioProcessor(send_chat_func)

# New
AudioProcessor(peer_name, send_chat_func)

Usage Examples

1. Multiple Speakers Scenario

# In a 3-person meeting:
🎤 Alice: I think we should start with the quarterly review
🎤 Bob [partial]: That sounds like a good...
🎤 Bob: That sounds like a good idea to me
🎤 Charlie: I agree, let's begin

2. Debugging Multiple Processors

# Check status of all active processors
python force_transcription.py stats

# Force transcription for all peers
python force_transcription.py

3. Monitoring Active Connections

from bots.whisper import get_active_processors

processors = get_active_processors()
print(f"Active speakers: {list(processors.keys())}")

Performance Considerations

Resource Usage

  • Memory: Linear scaling with number of speakers
  • CPU: Parallel processing threads (one per speaker)
  • Model: Shared Whisper model across all processors (efficient)

Scalability

  • Small groups (2-5 people): Excellent performance
  • Medium groups (6-15 people): Good performance
  • Large groups (15+ people): May need optimization

Optimization Strategies

  1. Silence Detection: Skip processing for quiet/inactive speakers
  2. Dynamic Cleanup: Remove processors for disconnected peers
  3. Configurable Thresholds: Adjust per-speaker sensitivity
  4. Resource Limits: Max concurrent processors if needed

Debugging Tools

1. Force Transcription (Enhanced)

# Shows status for all active peers
python force_transcription.py

# Output example:
🔍 Found 3 active audio processors:

👤 Alice:
  - Running: True
  - Buffer size: 5 frames
  - Queue size: 1
  - Current phrase length: 8000 samples

👤 Bob:
  - Running: True  
  - Buffer size: 0 frames
  - Queue size: 0
  - Current phrase length: 0 samples

2. Audio Statistics (Per-Peer)

python force_transcription.py stats

# Shows detailed metrics for each peer
📊 Detailed Audio Statistics for 2 processors:

👤 Alice:
Sample rate: 16000Hz
Current buffer size: 3
Processing queue size: 0
  Current phrase:
    Duration: 1.25s
    RMS: 0.0234
    Peak: 0.1892

3. Enhanced Logging

INFO - Creating new AudioProcessor for Alice
INFO - AudioProcessor initialized for Alice - sample_rate: 16000Hz
INFO - ✅ Transcribed (final) for Alice: 'Hello everyone'
INFO - Cleaning up AudioProcessor for disconnected peer: Bob

Migration Guide

For Existing Code

  • No changes needed for basic usage
  • Enhanced debugging with per-peer information
  • Better transcription quality automatically

For Advanced Usage

  • Use get_active_processors() to monitor speakers
  • Call cleanup_peer_processor() on peer disconnect
  • Check peer-specific statistics in force_transcription.py

Error Handling

Common Issues

  1. No AudioProcessor for peer: Automatically created on first audio
  2. Peer disconnection: Manual cleanup recommended
  3. Resource exhaustion: Monitor with get_active_processors()

Error Messages

ERROR - Cannot create AudioProcessor for Alice: no send_chat_func available
WARNING - No audio processor available to handle audio data for Bob
INFO - Cleaning up AudioProcessor for disconnected peer: Charlie

Future Enhancements

Planned Features

  1. Voice Activity Detection: Only process when speaker is active
  2. Speaker Diarization: Merge multiple audio sources per speaker
  3. Language Detection: Per-speaker language settings
  4. Quality Metrics: Per-speaker transcription confidence scores

Possible Optimizations

  1. Shared Processing: Batch multiple speakers in single inference
  2. Dynamic Model Loading: Different models per speaker/language
  3. Audio Mixing: Optional mixed transcription for meeting notes
  4. Real-time Adaptation: Adjust thresholds per speaker automatically

This new architecture provides a robust foundation for multi-speaker ASR with clear attribution, better quality, and comprehensive debugging capabilities.