Endpointing Model for Speaker Recognition
Overview
This script implements an integrated continuous conversation pipeline composed of three primary submodules:
-
VAD (Voice Activity Detection):
Uses an ONNX-based model to continuously analyze incoming audio chunks. The raw audio is first normalized and converted into a log-mel spectrogram (with 80 mel bins) before running inference. The VAD determines whether speech is currently present or absent. -
ASR (Automatic Speech Recognition):
Utilizes the Azure Speech SDK to transcribe finalized audio segments into text. Audio segments are written to a temporary WAV file and sent to Azure via therecognize_once()
API. -
Endpointing:
Uses a fine-tuned DistilBERT model (loaded via Hugging Face Transformers) to decide whether the recognized text (accumulated from ASR) forms a complete utterance. A rolling context (default 20 words) is maintained and evaluated. The endpointing decision is used to "finalize" an utterance immediately if it is classified as complete.
In addition, the script incorporates:
- A persistent status line that updates in the terminal to show the current VAD state, endpointing status, utterance duration, and a preview of the recognized text buffer.
- A maximum segment length mechanism so that even during long uninterrupted speech, the system forces an ASR pass every set interval (default 4 seconds).
- Forced endpointing logic (with a guard check) so that if no new text arrives for a specified timeout (1.5 seconds), any pending text is finalized.
- Detailed logging (with timestamps) for major events such as VAD start/end, ASR recognition results, endpointing decisions, and forced timeouts.
Key Features
- Real-time processing with minimal latency
- Intelligent endpointing using a fine-tuned DistilBERT model
- Production-ready deployment with NVIDIA Triton Inference Server
- Configurable parameters for different use cases
- Detailed logging for debugging and analysis
- Persistent status display for monitoring system state
Global Configuration
All configuration toggles and parameters are defined at the top of the script. This includes module toggles, audio settings, model paths, thresholds, and timeouts. For example:
-
Module Toggles:
-
USE_VAD
: Enable/disable VAD module. -
USE_ASR
: Enable/disable ASR module. -
USE_ENDPOINTING
: Enable/disable endpointing module.
-
-
VAD Parameters:
-
VAD_TIMEOUT_MS
: 800 ms of silence triggers end-of-utterance. -
VAD_START_THRESHOLD
&VAD_END_THRESHOLD
: Thresholds for starting and ending speech detection. -
VAD_STEP_SIZE
&VAD_WINDOW_SIZE
: Duration (in seconds) for each audio chunk and the rolling buffer. -
SPEECH_START_CONFIRM_FRAMES
andSPEECH_STOP_CONFIRM_FRAMES
: Number of successive frames required to confirm start or stop.
-
-
ASR Parameters (Azure):
-
AZURE_SUBSCRIPTION_KEY
andAZURE_REGION
: Credentials for Azure Speech Services. -
TEMP_AUDIO_FILE
: Temporary file name used to store audio segments.
-
-
Endpointing Parameters:
-
ENDPOINTING_CONTEXT_SIZE
: How many words are maintained in the rolling context. -
ENDPOINTING_PROB_THRESHOLD
: Probability threshold for classifying an utterance as complete. -
ENDPOINTING_TIMEOUT_MS
: 1.5 s timeout if no new text is received. -
ENDPOINTING_MODEL_DIR
: Directory containing the fine-tuned DistilBERT model.
-
-
Max Segment Length:
-
MAX_SEGMENT_SECONDS
: Forces an ASR pass if an utterance lasts longer than 4 seconds.
-
Dependencies
- Python 3.7+
- PyAudio: For microphone audio capture.
- NumPy: For array and signal processing.
- Librosa: For computing mel spectrograms.
- ONNX Runtime: For running the VAD model.
- Azure Cognitive Services Speech SDK: For ASR.
- PyTorch: Required by the endpointing model.
- Transformers (Hugging Face): For loading the DistilBERT endpointing model.
- Logging and System Libraries: Standard Python modules for logging and system info.
Module Details
VAD Module
-
Class:
SimpleVAD
-
Purpose:
Processes incoming audio chunks by updating a rolling buffer, converting raw audio into a log-mel spectrogram, and running ONNX inference to determine the speech probability. -
Key Methods:
-
process_chunk(chunk)
:- Normalizes the audio.
- Computes an 80-bin mel spectrogram (using
librosa.feature.melspectrogram
). - Converts the spectrogram to log scale.
- Reshapes the input to
[1, 80, T]
and runs inference through the ONNX model. - Returns a speech probability.
-
update(chunk)
:- Calls
process_chunk()
and applies logic to update internal state (using counters for start and stop confirmation). - Returns
"start"
,"end"
, or"continue"
based on detection.
- Calls
-
ASR Module
-
Function:
transcribe_with_azure(file_path)
-
Purpose:
Reads an audio file (written as a temporary WAV file) and sends it to Azure Speech Services via therecognize_once()
method. Returns the recognized text. -
Logging:
Logs whether Azure recognized speech or if an error occurred.
Endpointing Module
-
Class:
EndpointingModel
-
Purpose:
Maintains a rolling context (up to a set number of words) from recognized text. Uses a fine-tuned DistilBERT model to decide if the current context forms a complete utterance. -
Key Methods:
-
update_context(new_text)
:- Splits incoming text into words and updates the rolling context.
-
predict_complete()
:- Feeds the rolling context to the model and returns
True
if the predicted probability for "complete" exceeds the threshold.
- Feeds the rolling context to the model and returns
-
reset_context()
:- Clears the context after an utterance is finalized.
-
Helper Functions
Audio Processing Helpers
-
write_audio_to_file(audio_chunks, file_path, channels, sample_rate, sample_format)
Purpose: Writes accumulated raw audio chunks into a WAV file for processing by ASR.
Status Display Helpers
-
update_status_line(vad_state, endpoint_state, asr_preview, utterance_duration)
Purpose: Prints a persistent status line that updates in place in the terminal.
Logging Helpers
-
log_timestamp(event)
Purpose: Logs an event with a precise timestamp for analysis.
Installation
Prerequisites
- Python 3.7+
- NVIDIA GPU (recommended for production deployment)
- Docker (for Triton deployment)
Setup
- Clone the repository:
git clone https://github.com/yourusername/endpointing-model.git cd endpointing-model
- Install dependencies:
pip install -r requirements.txt
- Configure environment variables:
export AZURE_SUBSCRIPTION_KEY="your_subscription_key" export AZURE_REGION="your_region"
Usage
Running the Pipeline
To run the complete pipeline:
python Conversation.py
The system will:
- Initialize all components
- Start capturing audio from the default microphone
- Process the audio in real-time
- Display a status line showing the current state
- Log all major events with timestamps
Status Display
The system provides a real-time status display showing:
- VAD state (SPEECH/SILENCE)
- Endpointing state (COMPLETE/INCOMPLETE)
- Current utterance duration
- Preview of recognized text
Triton Inference Server Deployment
For production deployment, the endpointing model can be served using NVIDIA Triton Inference Server:
Build the Docker container:
cd triton_deployment
docker build -t endpointing-triton .
Run the Triton server:
./server.sh
Directory Structure
endpointing-model/
├── Conversation.py # Main pipeline implementation
├── models/ # Model files
│ └── vad_multilingual_marblenet.onnx
├── complete_model/ # DistilBERT endpointing model
└── triton_deployment/ # Triton server deployment
├── Dockerfile # Docker configuration
├── server.sh # Server startup script
└── triton_model_repo/ # Model repository
└── endpointing_pipeline/ # Model configuration
├── config.pbtxt # Triton model configuration
└── 1/ # Model version
└── model.py # Python backend implementation
Performance Considerations
- Latency: The system is designed for real-time use with minimal latency.
-
Resource Usage:
- VAD: Lightweight, can run on CPU
- Endpointing: More resource-intensive, benefits from GPU acceleration
- Scalability: Triton deployment enables high-throughput serving