Voice Activity Detection and utterance controls

The following parameters are most often used in real-time transcription scenarios using V‑Blaze.

Table 1. Voice Activity Detection controls

Name

Values

Description

activitylevel

integer

default is 175

Specifies the volume threshold for active versus inactive audio. This value should be high enough to screen out noise, but low enough to clearly trigger on speech. Range is 0-32768, correlating to the average magnitude of a signed 16-bit LPCM frame.

uttmaxgap

integer

Specifies the maximum gap in seconds that can occur between utterances before they are combined. During text processing, each utterance is buffered for a maximum of uttmaxgap seconds, which controls whether subsequent utterances are considered for possible combination during text processing modifications such as numtrans and substitutions.

Tip: During real-time speech processing, uttmaxgap must be set to 0. Otherwise, utterances may be delayed for modification, which would result in higher utterance latency and utterances not combining during text processing.

uttmaxsilence

integer

default is 800 ms

Specifies the maximum amount of silence in milliseconds that can occur between speech sounds without terminating the current utterance. Once a silence occurs that exceeds uttmaxsilence milliseconds, an utterance “cut” is made within the detected silent region.

Refer to uttmaxsilence for more information on this parameter.

uttmaxtime

integer

default is 150 seconds

Specifies the maximum amount of time in seconds that is allotted for a spoken utterance. Normally an utterance is terminated by a sufficient duration of silence, but if no such period of silence is encountered prior to reaching uttmaxtime , the utterance is terminated forcibly.

uttminactivity

integer

default is 500 ms

Specifies how much activity is needed (without uttpadding ) to classify as an utterance. This is usually lower if activitylevel or uttpadding are high and vice-versa.

uttpadding

integer

default is 300 ms

Specifies how much padding around the active area to treat as active. Typically the higher the activitylevel , the more padding is needed. Lower activity levels require less padding.

vadtype

energy (default), level

The two types of Voice Activity Detection (VAD) available during transcription are energy and level . The energy setting instructs the engine to use the amount of energy in the audio signal to determine if speech might be present. This is the best setting to use when transcribing audio files (for post-call or batch transcription).

The level setting instructs the engine to use the simple amplitude level of the audio signal for VAD. This is the best setting to use when transcribing live audio streams (for in-call or real-time transcription) because it operates instantaneously, without the need for buffering.

uttmaxsilence

Values: integer

Description:

Specifies the maximum amount of silence in milliseconds that can occur between speech sounds without terminating the current utterance. Once a silence occurs that exceeds uttmaxsilence milliseconds, an utterance “cut” is made within the detected silent region.

The default value for uttmaxsilence is 800 milliseconds. This setting will not need to be modified except in unusually aggressive real-time deployments. In most cases, shortening uttmaxsilence to be less than 650 milliseconds will compromise accuracy. This decrease in accuracy worsens as uttmaxsilence is reduced towards its minimum setting of 100 milliseconds.

Note: When lowering the uttmaxsilence value, accuracy is reduced because the shorter threshold for splitting audio regions into utterances will result in shorter utterances on average. Shorter utterances mean that less contextual information available for error reduction.

uttmaxtime

Values: integer

Description:

Specifies the maximum amount of time in seconds that is allotted for a spoken utterance. Normally an utterance is terminated by a sufficient duration of silence, but if no such period of silence is encountered prior to reaching uttmaxtime , the utterance is terminated forcibly.

The default value for uttmaxtime is 150 seconds. Human utterances are typically 5-20 seconds long. The uttmaxtime setting rarely requires modification. Examples of use cases that can benefit from adjusting this parameter include transcribing monologues or speeches with unusually long unbroken utterances, and real-time deployments with aggressive turn-around time requirements.

In most cases, shortening the value of the uttmaxtime tag to be less than 20 seconds will compromise accuracy, getting worse as uttmaxtime is reduced towards its minimum setting of 1 second.

Note: When reducing the value of the uttmaxtime tag, accuracy is reduced when the ASR engine is forced to terminate an utterance at the uttmaxtime boundary. Such "cuts" take place while a word is being spoken. This means that a portion of one word will be in the first utterance, while the remainder of the word is located in the second. With few exceptions, word fragments do not sound like the original word, resulting in erroneous transcription. In addition, shorter utterances also contain less context, further reducing achievable accuracy.