Real-time streaming transcription with V‑Blaze

The three phases of the transcription of an utterance are provided below to illustrate the precise timing of real-time V‑Blaze transcription:

  1. V‑Blaze receives audio data packets as fast as the sender can provide them. For a live 2-channel telephone call being sampled at 8 KHz and encoded as PCM with a 2-byte sample size, each V‑Blaze stream will receive (8000 Hz * 2 Bytes * 2 Channels) = 32,000 bytes per second. V‑Blaze will buffer this audio data until it detects a sufficiently long silence or until the maximum utterance duration has been exceeded. For example, for an utterance of duration 15 seconds, V‑Blaze will spend 15 seconds buffering audio.

  2. Once V‑Blaze has buffered a complete utterance, it will transcribe the utterance. If V‑Blaze has been configured to transcribe at 1x, it can take up to 15 seconds to complete the transcription process of a 15-second utterance. If it has been configured to transcribe at 5X, it can take up to 15/5 = 3 seconds.

  3. As soon as the utterance transcription process has completed, it is POSTed to the utterance callback server.

For example, suppose a server (the "sender") is configured to broadcast a telephone call on port 5555, using the WAV container format and a supported audio encoding method such as PCM. Likewise, a server (the "receiver") is configured to receive utterance transcript data on port 5556. Note that sender and receiver can be running on the same machine, and can even be different threads of the same program, or they can be two entirely different, geographically distributed systems. The following request will initiate real-time transcription:

curl -F utterance_callback=http://receiver:5556/utterance \
     -F datahdr=WAVE \
     -F socket=sender:5555 \
     http://vblaze_name:17171/transcribe

It is often the case that real-time streaming audio will not include a WAV header. When transcribing raw or headerless audio, the datahdr field is not used to define the file header; raw encoded audio is supported by explicitly providing the information normally provided by the header. This includes at a minimum the sample rate, sample width, and encoding. The byte endianness can also be specified, however the default value of LITTLE is usually correct. The following is an example:

curl -F utterance_callback=http://receiver:5556/utterance \
     -F socket=sender:5555 \
     -F samprate=8000 \
     -F sampwidth=2 \
     -F encoding=spcm \
      http://vblaze_name:17171/transcribe

Refer to Voice Activity Detection and utterance controls for more information on parameters used with real-time transcription.

Real-time calls

Important: This method requires V‑Blaze version 7.2 or greater.

Real-time calls can be emulated by provided a --limit-rate option. Here we assume sample1.wav has a sample rate of 8kHz and 2 bytes per sample meaning 16000 bytes per second would be a realtime transfer.

Real-time redacted audio

Important: This method requires V‑Blaze version 7.2 or greater.

When realtime=true, scrubaudio=true, and notext=true are specified, redacted audio is streamed back in real time.

The following examples transcribe a file named sample1.wav using a V‑Blaze REST API instance running on a server named vblaze_name. Both parameters can be changed without altering the function of any commands.

curl -s -XPOST \
-T - \
'vblaze_name:17171/transcribe?' \
'realtime=true&uttmaxgap=0&vadtype=level&scrubaudio=true&notext=true' |
play -V0 -q -
curl -s \
-F 'file=@sample1.wav' \
'vblaze_name:17171/transcribe?' \
'realtime=true&uttmaxgap=0&vadtype=level&scrubaudio=true&notext=true' |
play -V0 -q -

Real-time transcription test example

This example requires three terminal sessions. One to receive utterances in real time, one to send audio, and one for API requests. Multiple terminal sessions can be accomplished using terminal multiplexers such as Screen or tmux.

This example uses localhost , however, the callbacks and audio source can be pushed to other hosts.

Receive real-time utterance-by-utterance callbacks

Use the following command to run netcat in a loop to receive POSTs:

while true ; do /bin/echo -e 'HTTP/1.1 200 OK\r\n' | nc -l 5556; done
Send an audio stream via a socket with netcat

Use the pv command to simulate real-time audio source and decoding by limiting the audio rate.

pv -L 16000 mono_pcm_sample.wav | nc -l 5555
Note: Double the audio rate to 32000 for stereo PCM

Or more generally:

cat mono_pcm_sample.wav | pv -L 16000 | nc -l 5555
cat /opt/voci/server/examples/sample7.wav | pv -L 16k | nc -l 5555
Send a request to the ASR server

The following ASR parameters below are a good starting point for common situations, but may require adjustments for specific environments or requirements.

curl -F realtime=true -F output=text \
     -F vadtype=level -F activitylevel=175 \
     -F uttminactivity=1000 -F uttmaxsilence=500 -F uttpadding=250 -F uttmaxtime=15 -F uttmaxgap=0 \
     -F datahdr=WAVE -F socket=localhost:5555 \
     -F utterance_callback=http://localhost:5556 \
     -X POST http://localhost:17171/transcribe
Note: Both the utterance_callback and socket settings are interpreted from the point-of-view of the ASR host. The ASR must be able to resolve and connect to the hosts and ports specified.
Tip: Making API calls to V‑Blaze and specifying header tags such as samprate, encoding, and endian that do not match what is found in a WAV file's header information will cause errors during the transcription process, and the audio file will not be processed. If the audio file is a native WAV file, then there is no need to specify anything other than datahdr=WAVE in the API call. The only time you need to specify nchannels, sampwidth, samprate, encoding, and endian, is when the audio files are raw or headerless.

RAW or headerless audio

Oftentimes, real-time streaming audio will not include a RIFF header. When transcribing raw or headerless audio, the datahdr field is not used to define the file header; raw encoded audio is supported by explicitly providing the information normally provided by the header. This includes at a minimum the sample rate, sample width, and encoding. The byte endianness can also be specified, however the default value of LITTLE is usually correct. The following is an example:

curl -F utterance_callback=http://receiver:5556/utterance/index.asp \
     -F socket=sender:5555 \
     -F samprate=8000 \
     -F nchannels=2 \
     -F sampwidth=2 \
     -F encoding=SPCM \
      http://vblaze_name:17171/transcribe

Alternative implementation method: sending realtime audio in the POST

The ASR REST API enables submission of realtime audio directly in the request POST by sending chunks of audio data directly through the HTTP connection as they become available. Voci can provide the requests_mpstream.py reference code by request if this style of data flow is required for the deployment architecture.

An example of using requests_mpstream.py :

pv -qL 16000 sample1.wav | \
    python requests_mpstream.py http://localhost:17171/transcribe - realtime=true output=text \
        vadtype=level activitylevel=175 \
        uttminactivity=1000 uttmaxsilence=500 uttpadding=250 uttmaxtime=15 uttmaxgap=0 \
        datahdr=WAVE \
        utterance_callback=http://localhost:5556