Voice Activity Detection and utterance controls
The following parameters are most often used in real-time transcription scenarios using V‑Blaze.
Name | Values | Description |
---|---|---|
activitylevel | integer default is 175 | Specifies the volume threshold for active versus inactive audio. This value should be high enough to screen out noise, but low enough to clearly trigger on speech. Range is 0-32768, correlating to the average magnitude of a signed 16-bit LPCM frame. |
uttmaxgap | integer | Specifies the maximum gap in seconds that can occur between utterances before they are combined. During text processing, each utterance is buffered for a maximum of Tip: During real-time speech processing, uttmaxgap must be set to 0. Otherwise, utterances may be delayed for modification, which would result in higher utterance latency and utterances not combining during text processing. |
integer default is 800 ms | Specifies the maximum amount of silence in milliseconds that can occur between speech sounds without terminating the current utterance. Once a silence occurs that exceeds Refer to uttmaxsilence for more information on this parameter. | |
integer default is 150 seconds | Specifies the maximum amount of time in seconds that is allotted for a spoken utterance. Normally an utterance is terminated by a sufficient duration of silence, but if no such period of silence is encountered prior to reaching | |
uttminactivity | integer default is 500 ms | Specifies how much activity is needed (without |
uttpadding | integer default is 300 ms | Specifies how much padding around the active area to treat as active. Typically the higher the |
vadtype | energy (default), level | The two types of Voice Activity Detection (VAD) available during transcription are The |
uttmaxsilence
Values: integer
Description:
Specifies the maximum amount of silence in milliseconds that can occur between speech sounds without terminating the current utterance. Once a silence occurs that exceeds
uttmaxsilence
milliseconds, an utterance “cut” is made within the detected silent region.
The default value for
uttmaxsilence
is 800 milliseconds. This setting will not need to be modified except in unusually aggressive real-time deployments. In most cases, shortening
uttmaxsilence
to be less than 650 milliseconds will compromise accuracy. This decrease in accuracy worsens as
uttmaxsilence
is reduced towards its minimum setting of 100 milliseconds.
uttmaxsilence
value, accuracy is reduced because the shorter threshold for splitting audio regions into utterances will result in shorter utterances on average. Shorter utterances mean that less contextual information available for error reduction.uttmaxtime
Values: integer
Description:
Specifies the maximum amount of time in seconds that is allotted for a spoken utterance. Normally an utterance is terminated by a sufficient duration of silence, but if no such period of silence is encountered prior to reaching
uttmaxtime
, the utterance is terminated forcibly.
The default value for
uttmaxtime
is 150 seconds. Human utterances are typically 5-20 seconds long. The
uttmaxtime
setting rarely requires modification. Examples of use cases that can benefit from adjusting this parameter include transcribing monologues or speeches with unusually long unbroken utterances, and real-time deployments with aggressive turn-around time requirements.
In most cases, shortening the value of the
uttmaxtime
tag to be less than 20 seconds will compromise accuracy, getting worse as uttmaxtime is reduced towards its minimum setting of 1 second.
uttmaxtime
tag, accuracy is reduced when the ASR engine is forced to terminate an utterance at the
uttmaxtime
boundary. Such "cuts" take place while a word is being spoken. This means that a portion of one word will be in the first utterance, while the remainder of the word is located in the second. With few exceptions, word fragments do not sound like the original word, resulting in erroneous transcription. In addition, shorter utterances also contain less context, further reducing achievable accuracy.