Real-time transcription

Real-time streaming transcription enables use cases such as in-call monitoring and alerting of a supervisor to intervene in an active call. When real-time mode is activated, a transcript of each utterance is returned as soon as that utterance has been transcribed.

V‑Blaze can be configured to transcribe live streaming audio at a rate between 1X (real time) and 5X (five times faster than real time). In most cases 1X is sufficient. Higher speeds are offered for demanding use cases where milliseconds count. Delivering 5X real time requires five times more hardware resources than does 1X, all other factors being equal.

Real-time transcription resembles the standard callback mechanism with one major difference. Instead of POSTing the entire transcript to the callback server, the transcript of each utterance is POSTed as soon as it is ready. Utterance transcripts are HTTP POSTed to a client-side callback server. Utterances are transcribed based on two events:

  1. Break(s) in speech

  2. Max utterance length

The max utterance length setting can be as high as 80 seconds (15 seconds is typical), but this is a variable that will require tweaking per solution and use case. Note that setting max utterance too low will most likely degrade transcription accuracy; a lower setting reduces the the amount of context available to support recognition, which the ASR engine relies on.

Latency is measured from the time an utterance to be transcribed ends to the time that a transcription result is posted. Load impacts this latency:

  • Light load: 0.2x latency should be expected

  • Medium load: 1x latency should be expected

  • Heavy load: > 1x latency should be expected

Refer to the Real-time streaming transcription with V‑Blaze section of the V‑Blaze API Guide for more information on how to use real-time transcription.