119 lines
4.6 KiB
Markdown
119 lines
4.6 KiB
Markdown
# Whisper ASR Enhanced Logging
|
|
|
|
This enhancement adds detailed logging to the Whisper ASR system to help debug and monitor speech recognition performance.
|
|
|
|
## New Logging Features
|
|
|
|
### 1. Model Loading
|
|
- Logs when the Whisper model is being loaded
|
|
- Shows which model variant is being used
|
|
- Confirms successful processor and model initialization
|
|
|
|
### 2. Audio Frame Processing
|
|
- **Frame-by-frame details**: Sample rate, format, layout, shape, and data type
|
|
- **Audio quality metrics**: RMS level and peak amplitude for each frame
|
|
- **Format conversions**: Logs when converting stereo to mono, resampling, or normalizing
|
|
- **Frame counting**: Reduced noise by logging full details every 20 frames
|
|
|
|
### 3. Audio Buffer Management
|
|
- **Buffer status**: Shows buffer size in frames and milliseconds
|
|
- **Queue management**: Tracks when audio is queued for processing
|
|
- **Audio metrics**: RMS, peak amplitude, and duration for queued chunks
|
|
- **Queue size monitoring**: Shows processing queue depth
|
|
|
|
### 4. ASR Processing Pipeline
|
|
- **Processing timing**: Separate timing for feature extraction, model inference, and decoding
|
|
- **Audio analysis**: Duration, RMS, and peak levels for audio being transcribed
|
|
- **Phrase detection**: Logs when phrases are considered complete
|
|
- **Streaming vs final**: Clear distinction between partial and final transcriptions
|
|
|
|
### 5. Performance Metrics
|
|
- **Processing time**: How long each transcription takes
|
|
- **Audio-to-text ratio**: Processing time vs audio duration
|
|
- **Queue depth**: Processing backlog monitoring
|
|
|
|
## Log Levels
|
|
|
|
### DEBUG Level
|
|
- Individual audio frame details
|
|
- Buffer management operations
|
|
- Processing queue status
|
|
- Detailed timing information
|
|
- Audio quality metrics for each chunk
|
|
|
|
### INFO Level
|
|
- Model loading status
|
|
- Track connection events
|
|
- Completed transcriptions with timing
|
|
- Periodic audio frame summaries (every 20 frames)
|
|
- Major processing events
|
|
|
|
### WARNING Level
|
|
- Missing audio processor
|
|
- Event loop issues
|
|
- Queue full conditions
|
|
- Non-audio frame reception
|
|
|
|
### ERROR Level
|
|
- Model loading failures
|
|
- Transcription errors
|
|
- Processing loop crashes
|
|
- Track handling exceptions
|
|
|
|
## Usage
|
|
|
|
### Enable Debug Logging
|
|
```bash
|
|
# From the voicebot directory
|
|
python set_whisper_debug.py
|
|
```
|
|
|
|
### Return to Normal Logging
|
|
```bash
|
|
python set_whisper_debug.py info
|
|
```
|
|
|
|
### Sample Enhanced Log Output
|
|
|
|
```
|
|
INFO - Loading Whisper model: distil-whisper/distil-large-v3
|
|
INFO - Whisper processor loaded successfully
|
|
INFO - Whisper model loaded and set to evaluation mode
|
|
INFO - AudioProcessor initialized - sample_rate: 16000Hz, frame_size: 480, phrase_timeout: 3.0s
|
|
INFO - Received audio track from user_123, starting transcription (processor available: True)
|
|
DEBUG - Received audio frame from user_123: 48000Hz, s16, stereo
|
|
DEBUG - Audio frame data: shape=(1440, 2), dtype=int16
|
|
DEBUG - Converted stereo to mono: (1440, 2) -> (1440,)
|
|
DEBUG - Normalized int16 audio to float32
|
|
DEBUG - Resampled audio: 48000Hz -> 16000Hz, 1440 -> 480 samples
|
|
DEBUG - Audio frame #1: RMS: 0.0234, Peak: 0.1892
|
|
DEBUG - Added audio chunk: 480 samples, buffer size: 1 frames (30ms)
|
|
INFO - Audio frame #20 from user_123: 48000Hz, s16, stereo, 480 samples, RMS: 0.0156, Peak: 0.2103
|
|
DEBUG - Buffer threshold reached, queuing for processing
|
|
DEBUG - Queuing audio chunk: 4800 samples, 0.30s duration, RMS: 0.0189, Peak: 0.2103
|
|
DEBUG - Added to processing queue, queue size: 1
|
|
DEBUG - Retrieved audio chunk from queue, remaining queue size: 0
|
|
INFO - Starting streaming transcription: 2.10s audio, RMS: 0.0245, Peak: 0.3456
|
|
DEBUG - ASR timing - Feature extraction: 0.045s, Model inference: 0.234s, Decoding: 0.012s, Total: 0.291s
|
|
INFO - Transcribed (streaming): 'Hello there, how are you doing today?' (processing time: 0.291s, audio duration: 2.10s)
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### No Transcriptions Appearing
|
|
- Check if AudioProcessor is created: Look for "AudioProcessor initialized" message
|
|
- Verify audio quality: Look for RMS levels > 0.001 and reasonable peak values
|
|
- Check processing queue: Should show "Added to processing queue" messages
|
|
|
|
### Poor Recognition Quality
|
|
- Monitor RMS and peak levels - very low values indicate quiet audio
|
|
- Check processing timing - slow inference may indicate resource issues
|
|
- Look for resampling messages - frequent resampling can degrade quality
|
|
|
|
### Performance Issues
|
|
- Monitor "ASR timing" logs for slow components
|
|
- Check queue depth - high values indicate processing backlog
|
|
- Look for "queue full" warnings indicating dropped audio
|
|
|
|
This enhanced logging provides comprehensive visibility into the ASR pipeline, making it much easier to diagnose audio quality issues, performance problems, and configuration errors.
|