Voice Activity Detection and utterance controls
Voice Activity Detection (VAD) is the Voci ASR engine's process for differentiating human voice and silence.
VAD processing enables parallelization during ASR transcription, even within a single audio stream or phone call, by breaking audio data into discrete chunks of speech called utterances.
This processing results in a high level of efficiency and reduced latency, because the ASR engine transcribes audio only for data in which speech is detected.
Voci ASR can use either of two types of VAD: energy or level.
When diarization is enabled, the ASR engine uses an independent algorithm to detect utterance breaks simultaneously with speaker changes.
Energy VAD
Energy VAD accumulates audio signal energy over a window of time to determine the presence of speech.
Energy VAD is the default processor and performs well for most use cases. It adapts well to a variety of speakers and phone signal variations, even within the same call.
Energy VAD is the best type for transcribing audio post-call because it accounts for and detects speech at different volumes, and because the latency incurred and buffering required are inconsequential for post-call processing.
Generally speaking, energy VAD is the best for most use cases. The only exception is processing audio in real time.
Level VAD
Level VAD uses an algorithm that detects speech from the raw amplitude level of the audio signal. Amplitude is perceived by human ears as volume or loudness.
VAD uses default or request parameters to define a buffer window around the audio's amplitude peaks to ensure that complementary parts of the audio signal are included in the utterances.
Level VAD can be useful for real-time ASR applications where strict latency requirements must be met because the maximum buffer and window sizes are directly configurable.
VAD parameters
Some VAD parameter default values are specified by the ASR engine, and some are specified by the language model used to process the transcription request. In any case, VAD parameters specified with a transcription request override all configured defaults, whether from the engine or the language model. The parameters below may be specified as stream tags with a transcription request.
- vadtype
-
VAD type — Energy, level
Value type — string
Values — energy, level
Default — energy
Description — Specifies the type of VAD to use for the request.
- activitylevel
-
VAD type — Level
Value type — integer
Values — 0 to 32768
Default — 175
Description — Specifies the amplitude threshold for active versus inactive audio. This value should be high enough to screen out noise, but low enough to clearly trigger on speech. Range is 0-32768, correlating to the entire range of values representable by a signed 16-bit LPCM frame.
- uttmaxgap
-
VAD type — Energy, level
Value type — positive float (seconds)
Default — 2
Description — Any value greater than 0 will cause utterances to be held and buffered in order to apply potential substitutions, punctuation, or numtrans processing. Increase
uttmaxgap
to include more context in a single utterance.When set to 0, utterances are released immediately after they are processed, which is typically used for real-time deployments.
Tip:During real-time speech processing,
uttmaxgap
must be set to 0. Otherwise, utterances are delayed. - uttmaxsilence
-
VAD type — Level
Value type — integer
Values — 100 to 32768
Default — 500 (milliseconds)
Description — Specifies the maximum amount of silence in milliseconds that can occur between speech sounds without terminating the current utterance. Once a silence occurs that exceeds
uttmaxsilence
milliseconds, the utterance is terminated at the detected silent region.Decreasing
uttmaxsilence
increases the number of utterance breaks, which can reduce accuracy. - uttmaxtime
-
VAD type — Energy, level
Value type — integer
Values — 1 to 150 (seconds)
Default — set to 90 in most models; if not specified, the engine default of 80 is used
Description — Specifies the maximum time in seconds allotted for an utterance. Normally an utterance is terminated by a sufficient duration of silence, but if no such period of silence is encountered prior to reaching
uttmaxtime
, the utterance is terminated forcibly.Human utterances are typically 5-20 seconds long, much shorter than the
uttmaxtime
default. As a result,uttmaxtime
rarely requires modification.Adjusting this parameter may be beneficial for transcribing monologues or speeches with unusually long unbroken utterances, and for real-time deployments with aggressive turn-around-time requirements.
Generally, reducing
uttmaxtime
reduces speed and reduces accuracy. Specifying a value of less than 20 is not recommended.Note: Relying onuttmaxtime
to terminate an utterance is not recommended because doing so risks terminating the utterance in the middle of a word. This means that a portion of one word is in one utterance, while the remainder is in another. Generally, word fragments do not resemble their full words enough for accurate transcription. Having shorter utterances means that less context is available for error reduction. - uttminactivity
-
VAD type — Energy*, level
Value type — integer
Values — 10 to 32768
Default — 250 (milliseconds)
Description — Specifies how much activity is needed (Mostly a level VAD setting; for level, does include
uttpadding
; for energy, does not includeuttpadding
) to classify as an utterance. Utterances shorter than the value defined foruttminactivity
are discarded. The minimum activity required is usually lower ifactivitylevel
oruttpadding
are high, or higher if they are low.* Typically and by default,
uttminactivity
is not a factor for energy VAD. - uttpadding
-
VAD type — Level
Value type — integer
Values — 0 to 32768
Default — 250 (milliseconds)
Description — Specifies how much padding around the active area to treat as active. Typically the higher the
activitylevel
, the more padding is needed. Lower activity levels require less padding.