ai-voicebot/docs/WHISPER_LOGGING_GUIDE.md

119 lines
4.6 KiB
Markdown

# Whisper ASR Enhanced Logging
This enhancement adds detailed logging to the Whisper ASR system to help debug and monitor speech recognition performance.
## New Logging Features
### 1. Model Loading
- Logs when the Whisper model is being loaded
- Shows which model variant is being used
- Confirms successful processor and model initialization
### 2. Audio Frame Processing
- **Frame-by-frame details**: Sample rate, format, layout, shape, and data type
- **Audio quality metrics**: RMS level and peak amplitude for each frame
- **Format conversions**: Logs when converting stereo to mono, resampling, or normalizing
- **Frame counting**: Reduced noise by logging full details every 20 frames
### 3. Audio Buffer Management
- **Buffer status**: Shows buffer size in frames and milliseconds
- **Queue management**: Tracks when audio is queued for processing
- **Audio metrics**: RMS, peak amplitude, and duration for queued chunks
- **Queue size monitoring**: Shows processing queue depth
### 4. ASR Processing Pipeline
- **Processing timing**: Separate timing for feature extraction, model inference, and decoding
- **Audio analysis**: Duration, RMS, and peak levels for audio being transcribed
- **Phrase detection**: Logs when phrases are considered complete
- **Streaming vs final**: Clear distinction between partial and final transcriptions
### 5. Performance Metrics
- **Processing time**: How long each transcription takes
- **Audio-to-text ratio**: Processing time vs audio duration
- **Queue depth**: Processing backlog monitoring
## Log Levels
### DEBUG Level
- Individual audio frame details
- Buffer management operations
- Processing queue status
- Detailed timing information
- Audio quality metrics for each chunk
### INFO Level
- Model loading status
- Track connection events
- Completed transcriptions with timing
- Periodic audio frame summaries (every 20 frames)
- Major processing events
### WARNING Level
- Missing audio processor
- Event loop issues
- Queue full conditions
- Non-audio frame reception
### ERROR Level
- Model loading failures
- Transcription errors
- Processing loop crashes
- Track handling exceptions
## Usage
### Enable Debug Logging
```bash
# From the voicebot directory
python set_whisper_debug.py
```
### Return to Normal Logging
```bash
python set_whisper_debug.py info
```
### Sample Enhanced Log Output
```
INFO - Loading Whisper model: distil-whisper/distil-large-v3
INFO - Whisper processor loaded successfully
INFO - Whisper model loaded and set to evaluation mode
INFO - AudioProcessor initialized - sample_rate: 16000Hz, frame_size: 480, phrase_timeout: 3.0s
INFO - Received audio track from user_123, starting transcription (processor available: True)
DEBUG - Received audio frame from user_123: 48000Hz, s16, stereo
DEBUG - Audio frame data: shape=(1440, 2), dtype=int16
DEBUG - Converted stereo to mono: (1440, 2) -> (1440,)
DEBUG - Normalized int16 audio to float32
DEBUG - Resampled audio: 48000Hz -> 16000Hz, 1440 -> 480 samples
DEBUG - Audio frame #1: RMS: 0.0234, Peak: 0.1892
DEBUG - Added audio chunk: 480 samples, buffer size: 1 frames (30ms)
INFO - Audio frame #20 from user_123: 48000Hz, s16, stereo, 480 samples, RMS: 0.0156, Peak: 0.2103
DEBUG - Buffer threshold reached, queuing for processing
DEBUG - Queuing audio chunk: 4800 samples, 0.30s duration, RMS: 0.0189, Peak: 0.2103
DEBUG - Added to processing queue, queue size: 1
DEBUG - Retrieved audio chunk from queue, remaining queue size: 0
INFO - Starting streaming transcription: 2.10s audio, RMS: 0.0245, Peak: 0.3456
DEBUG - ASR timing - Feature extraction: 0.045s, Model inference: 0.234s, Decoding: 0.012s, Total: 0.291s
INFO - Transcribed (streaming): 'Hello there, how are you doing today?' (processing time: 0.291s, audio duration: 2.10s)
```
## Troubleshooting
### No Transcriptions Appearing
- Check if AudioProcessor is created: Look for "AudioProcessor initialized" message
- Verify audio quality: Look for RMS levels > 0.001 and reasonable peak values
- Check processing queue: Should show "Added to processing queue" messages
### Poor Recognition Quality
- Monitor RMS and peak levels - very low values indicate quiet audio
- Check processing timing - slow inference may indicate resource issues
- Look for resampling messages - frequent resampling can degrade quality
### Performance Issues
- Monitor "ASR timing" logs for slow components
- Check queue depth - high values indicate processing backlog
- Look for "queue full" warnings indicating dropped audio
This enhanced logging provides comprehensive visibility into the ASR pipeline, making it much easier to diagnose audio quality issues, performance problems, and configuration errors.