Voice Activity Detection and utterance controls

Voice Activity Detection (VAD) is the Voci ASR engine's process for differentiating human voice and silence.

VAD processing enables parallelization during ASR transcription, even within a single audio stream or phone call, by breaking audio data into discrete chunks of speech called utterances.

This processing results in a high level of efficiency and reduced latency, because the ASR engine transcribes audio only for data in which speech is detected.

Voci ASR can use either of two types of VAD: energy or level.

Note:

When diarization is enabled, the ASR engine uses an independent algorithm to detect utterance breaks simultaneously with speaker changes.

Energy VAD

Energy VAD accumulates audio signal energy over a window of time to determine the presence of speech.

Energy VAD is the default processor and performs well for most use cases. It adapts well to a variety of speakers and phone signal variations, even within the same call.

Energy VAD is the best type for transcribing audio post-call because it accounts for and detects speech at different volumes, and because the latency incurred and buffering required are inconsequential for post-call processing.

Generally speaking, energy VAD is the best for most use cases. The only exception is processing audio in real time.

Level VAD

Level VAD uses an algorithm that detects speech from the raw amplitude level of the audio signal. Amplitude is perceived by human ears as volume or loudness.

VAD uses default or request parameters to define a buffer window around the audio's amplitude peaks to ensure that complementary parts of the audio signal are included in the utterances.

Level VAD can be useful for real-time ASR applications where strict latency requirements must be met because the maximum buffer and window sizes are directly configurable.

VAD parameters

Some VAD parameter default values are specified by the ASR engine, and some are specified by the language model used to process the transcription request. In any case, VAD parameters specified with a transcription request override all configured defaults, whether from the engine or the language model. The parameters below may be specified as stream tags with a transcription request.

vadtype

VAD type — Energy, level

Value type — string

Values — energy, level

Default — energy

Description — Specifies the type of VAD to use for the request.

activitylevel

VAD type — Level

Value type — integer

Values — 0 to 32768

Default — 175

Description — Specifies the amplitude threshold for active versus inactive audio. This value should be high enough to screen out noise, but low enough to clearly trigger on speech. Range is 0-32768, correlating to the entire range of values representable by a signed 16-bit LPCM frame.

uttmaxgap

VAD type — Energy, level

Value type — positive float (seconds)

Default — 2

Description — Any value greater than 0 will cause utterances to be held and buffered in order to apply potential substitutions, punctuation, or numtrans processing. Increase uttmaxgap to include more context in a single utterance.

When set to 0, utterances are released immediately after they are processed, which is typically used for real-time deployments.

Tip:

During real-time speech processing, uttmaxgap must be set to 0. Otherwise, utterances are delayed.

uttmaxsilence

VAD type — Level

Value type — integer

Values — 100 to 32768

Default — 500 (milliseconds)

Description — Specifies the maximum amount of silence in milliseconds that can occur between speech sounds without terminating the current utterance. Once a silence occurs that exceeds uttmaxsilence milliseconds, the utterance is terminated at the detected silent region.

Decreasing uttmaxsilence increases the number of utterance breaks, which can reduce accuracy.

uttmaxtime

VAD type — Energy, level

Value type — integer

Values — 1 to 150 (seconds)

Default — set to 90 in most models; if not specified, the engine default of 80 is used

Description — Specifies the maximum time in seconds allotted for an utterance. Normally an utterance is terminated by a sufficient duration of silence, but if no such period of silence is encountered prior to reaching uttmaxtime, the utterance is terminated forcibly.

Human utterances are typically 5-20 seconds long, much shorter than the uttmaxtime default. As a result, uttmaxtime rarely requires modification.

Adjusting this parameter may be beneficial for transcribing monologues or speeches with unusually long unbroken utterances, and for real-time deployments with aggressive turn-around-time requirements.

Generally, reducing uttmaxtime reduces speed and reduces accuracy. Specifying a value of less than 20 is not recommended.

Note: Relying on uttmaxtime to terminate an utterance is not recommended because doing so risks terminating the utterance in the middle of a word. This means that a portion of one word is in one utterance, while the remainder is in another. Generally, word fragments do not resemble their full words enough for accurate transcription. Having shorter utterances means that less context is available for error reduction.
uttminactivity

VAD type — Energy*, level

Value type — integer

Values — 10 to 32768

Default — 250 (milliseconds)

Description — Specifies how much activity is needed (Mostly a level VAD setting; for level, does include uttpadding; for energy, does not include uttpadding) to classify as an utterance. Utterances shorter than the value defined for uttminactivity are discarded. The minimum activity required is usually lower if activitylevel or uttpadding are high, or higher if they are low.

* Typically and by default, uttminactivity is not a factor for energy VAD.

uttpadding

VAD type — Level

Value type — integer

Values — 0 to 32768

Default — 250 (milliseconds)

Description — Specifies how much padding around the active area to treat as active. Typically the higher the activitylevel, the more padding is needed. Lower activity levels require less padding.