Real-time transcription

Real-time transcription resembles the standard callback mechanism with one major difference. Instead of POSTing the entire transcript to the callback server, the transcript of each utterance is POSTed as soon as it is ready. Utterance transcripts are HTTP POSTed to a client-side callback server. Utterances are transcribed based on two events:

  1. Break(s) in speech

  2. Max utterance length

The max utterance length setting can be as high as 80 seconds (15 seconds is typical), but this is a variable that will require tweaking per solution and use case. Note that setting max utterance too low will most likely degrade transcription accuracy; a lower setting reduces the the amount of context available to support recognition, which the ASR engine relies on.

Latency is measured from the time an utterance to be transcribed ends to the time that a transcription result is posted. Load impacts this latency:

  • Light load: 0.2x latency should be expected

  • Medium load: 1x latency should be expected

  • Heavy load: > 1x latency should be expected

Refer to the Real-time streaming transcription with V‑Blaze section of the V‑Blaze API Guide for more information on how to use real-time transcription.