Miko Gitlab Repository

94683598

Endpointing Model for Speaker Recognition

Python 3.7+ Hugging Face Transformers ONNX Runtime Triton Inference Server

Overview

This script implements an integrated continuous conversation pipeline composed of three primary submodules:

  1. VAD (Voice Activity Detection):
    Uses an ONNX-based model to continuously analyze incoming audio chunks. The raw audio is first normalized and converted into a log-mel spectrogram (with 80 mel bins) before running inference. The VAD determines whether speech is currently present or absent.

  2. ASR (Automatic Speech Recognition):
    Utilizes the Azure Speech SDK to transcribe finalized audio segments into text. Audio segments are written to a temporary WAV file and sent to Azure via the recognize_once() API.

  3. Endpointing:
    Uses a fine-tuned DistilBERT model (loaded via Hugging Face Transformers) to decide whether the recognized text (accumulated from ASR) forms a complete utterance. A rolling context (default 20 words) is maintained and evaluated. The endpointing decision is used to "finalize" an utterance immediately if it is classified as complete.

In addition, the script incorporates:

  • A persistent status line that updates in the terminal to show the current VAD state, endpointing status, utterance duration, and a preview of the recognized text buffer.
  • A maximum segment length mechanism so that even during long uninterrupted speech, the system forces an ASR pass every set interval (default 4 seconds).
  • Forced endpointing logic (with a guard check) so that if no new text arrives for a specified timeout (1.5 seconds), any pending text is finalized.
  • Detailed logging (with timestamps) for major events such as VAD start/end, ASR recognition results, endpointing decisions, and forced timeouts.

Key Features

  • Real-time processing with minimal latency
  • Intelligent endpointing using a fine-tuned DistilBERT model
  • Production-ready deployment with NVIDIA Triton Inference Server
  • Configurable parameters for different use cases
  • Detailed logging for debugging and analysis
  • Persistent status display for monitoring system state

Global Configuration

All configuration toggles and parameters are defined at the top of the script. This includes module toggles, audio settings, model paths, thresholds, and timeouts. For example:

  • Module Toggles:

    • USE_VAD: Enable/disable VAD module.
    • USE_ASR: Enable/disable ASR module.
    • USE_ENDPOINTING: Enable/disable endpointing module.
  • VAD Parameters:

    • VAD_TIMEOUT_MS: 800 ms of silence triggers end-of-utterance.
    • VAD_START_THRESHOLD & VAD_END_THRESHOLD: Thresholds for starting and ending speech detection.
    • VAD_STEP_SIZE & VAD_WINDOW_SIZE: Duration (in seconds) for each audio chunk and the rolling buffer.
    • SPEECH_START_CONFIRM_FRAMES and SPEECH_STOP_CONFIRM_FRAMES: Number of successive frames required to confirm start or stop.
  • ASR Parameters (Azure):

    • AZURE_SUBSCRIPTION_KEY and AZURE_REGION: Credentials for Azure Speech Services.
    • TEMP_AUDIO_FILE: Temporary file name used to store audio segments.
  • Endpointing Parameters:

    • ENDPOINTING_CONTEXT_SIZE: How many words are maintained in the rolling context.
    • ENDPOINTING_PROB_THRESHOLD: Probability threshold for classifying an utterance as complete.
    • ENDPOINTING_TIMEOUT_MS: 1.5 s timeout if no new text is received.
    • ENDPOINTING_MODEL_DIR: Directory containing the fine-tuned DistilBERT model.
  • Max Segment Length:

    • MAX_SEGMENT_SECONDS: Forces an ASR pass if an utterance lasts longer than 4 seconds.

Dependencies

  • Python 3.7+
  • PyAudio: For microphone audio capture.
  • NumPy: For array and signal processing.
  • Librosa: For computing mel spectrograms.
  • ONNX Runtime: For running the VAD model.
  • Azure Cognitive Services Speech SDK: For ASR.
  • PyTorch: Required by the endpointing model.
  • Transformers (Hugging Face): For loading the DistilBERT endpointing model.
  • Logging and System Libraries: Standard Python modules for logging and system info.

Module Details

VAD Module

  • Class: SimpleVAD
  • Purpose:
    Processes incoming audio chunks by updating a rolling buffer, converting raw audio into a log-mel spectrogram, and running ONNX inference to determine the speech probability.
  • Key Methods:
    • process_chunk(chunk):
      • Normalizes the audio.
      • Computes an 80-bin mel spectrogram (using librosa.feature.melspectrogram).
      • Converts the spectrogram to log scale.
      • Reshapes the input to [1, 80, T] and runs inference through the ONNX model.
      • Returns a speech probability.
    • update(chunk):
      • Calls process_chunk() and applies logic to update internal state (using counters for start and stop confirmation).
      • Returns "start", "end", or "continue" based on detection.

ASR Module

  • Function: transcribe_with_azure(file_path)
  • Purpose:
    Reads an audio file (written as a temporary WAV file) and sends it to Azure Speech Services via the recognize_once() method. Returns the recognized text.
  • Logging:
    Logs whether Azure recognized speech or if an error occurred.

Endpointing Module

  • Class: EndpointingModel
  • Purpose:
    Maintains a rolling context (up to a set number of words) from recognized text. Uses a fine-tuned DistilBERT model to decide if the current context forms a complete utterance.
  • Key Methods:
    • update_context(new_text):
      • Splits incoming text into words and updates the rolling context.
    • predict_complete():
      • Feeds the rolling context to the model and returns True if the predicted probability for "complete" exceeds the threshold.
    • reset_context():
      • Clears the context after an utterance is finalized.

Helper Functions

Audio Processing Helpers

  • write_audio_to_file(audio_chunks, file_path, channels, sample_rate, sample_format)
    Purpose: Writes accumulated raw audio chunks into a WAV file for processing by ASR.

Status Display Helpers

  • update_status_line(vad_state, endpoint_state, asr_preview, utterance_duration)
    Purpose: Prints a persistent status line that updates in place in the terminal.

Logging Helpers

  • log_timestamp(event)
    Purpose: Logs an event with a precise timestamp for analysis.

Installation

Prerequisites

  • Python 3.7+
  • NVIDIA GPU (recommended for production deployment)
  • Docker (for Triton deployment)

Setup

  1. Clone the repository:
    git clone https://github.com/yourusername/endpointing-model.git
    cd endpointing-model
  2. Install dependencies:
    pip install -r requirements.txt
  3. Configure environment variables:
    export AZURE_SUBSCRIPTION_KEY="your_subscription_key"
    export AZURE_REGION="your_region"

Usage

Running the Pipeline

To run the complete pipeline:

python Conversation.py

The system will:

  • Initialize all components
  • Start capturing audio from the default microphone
  • Process the audio in real-time
  • Display a status line showing the current state
  • Log all major events with timestamps

Status Display

The system provides a real-time status display showing:

  • VAD state (SPEECH/SILENCE)
  • Endpointing state (COMPLETE/INCOMPLETE)
  • Current utterance duration
  • Preview of recognized text

Triton Inference Server Deployment

For production deployment, the endpointing model can be served using NVIDIA Triton Inference Server:

Build the Docker container:

cd triton_deployment
docker build -t endpointing-triton .

Run the Triton server:

./server.sh

Directory Structure

endpointing-model/
├── Conversation.py           # Main pipeline implementation
├── models/                   # Model files
│   └── vad_multilingual_marblenet.onnx
├── complete_model/           # DistilBERT endpointing model
└── triton_deployment/        # Triton server deployment
    ├── Dockerfile           # Docker configuration
    ├── server.sh            # Server startup script
    └── triton_model_repo/   # Model repository
        └── endpointing_pipeline/  # Model configuration
            ├── config.pbtxt      # Triton model configuration
            └── 1/                # Model version
                └── model.py      # Python backend implementation

Performance Considerations

  • Latency: The system is designed for real-time use with minimal latency.
  • Resource Usage:
    • VAD: Lightweight, can run on CPU
    • Endpointing: More resource-intensive, benefits from GPU acceleration
  • Scalability: Triton deployment enables high-throughput serving