Real-time streaming transcription with V‑Blaze
The three phases of the transcription of an utterance are provided below to illustrate the precise timing of real-time V‑Blaze transcription:
-
V‑Blaze receives audio data packets as fast as the sender can provide them. For a live 2-channel telephone call being sampled at 8 KHz and encoded as PCM with a 2-byte sample size, each V‑Blaze stream will receive (8000 Hz * 2 Bytes * 2 Channels) = 32,000 bytes per second. V‑Blaze will buffer this audio data until it detects a sufficiently long silence or until the maximum utterance duration has been exceeded. For example, for an utterance of duration 15 seconds, V‑Blaze will spend 15 seconds buffering audio.
-
Once V‑Blaze has buffered a complete utterance, it will transcribe the utterance. If V‑Blaze has been configured to transcribe at 1x, it can take up to 15 seconds to complete the transcription process of a 15-second utterance. If it has been configured to transcribe at 5X, it can take up to 15/5 = 3 seconds.
-
As soon as the utterance transcription process has completed, it is POSTed to the utterance callback server.
For example, suppose a server (the "sender") is configured to broadcast a telephone call on port 5555, using the WAV container format and a supported audio encoding method such as PCM. Likewise, a server (the "receiver") is configured to receive utterance transcript data on port 5556. Note that sender and receiver can be running on the same machine, and can even be different threads of the same program, or they can be two entirely different, geographically distributed systems. The following request will initiate real-time transcription:
curl -F utterance_callback=http://receiver:5556/utterance \
-F datahdr=WAVE \
-F socket=sender:5555 \
http://vblaze_name:17171/transcribe
It is often the case that real-time streaming audio will not include a WAV header. When transcribing raw or headerless audio, the
datahdr
field is not used to define the file header; raw encoded audio is supported by explicitly providing the information normally provided by the header. This includes at a minimum the sample rate, sample width, and encoding. The byte endianness can also be specified, however the default value of
LITTLE
is usually correct. The following is an example:
curl -F utterance_callback=http://receiver:5556/utterance \
-F socket=sender:5555 \
-F samprate=8000 \
-F sampwidth=2 \
-F encoding=spcm \
http://vblaze_name:17171/transcribe
Refer to Voice Activity Detection and utterance controls for more information on parameters used with real-time transcription.
Real-time calls
Real-time calls can be emulated by provided a --limit-rate
option. Here we assume sample1.wav
has a sample rate of 8kHz and 2 bytes per sample meaning 16000 bytes per second would be a realtime transfer.
These examples transcribe a file named sample1.wav using a V‑Blaze REST API instance running on a server named vblaze_name; both parameters can be changed without altering the function of any commands.
Real-time redacted audio
When realtime=true
, scrubaudio=true
and notext=true
are specified, redacted audio is streamed back in real time.
curl -s -XPOST \
-T - \
'vblaze_name:17171/transcribe?' \
'realtime=true&uttmaxgap=0&vadtype=level&scrubaudio=true¬ext=true' |
play -V0 -q -
curl -s \
-F 'file=@sample1.wav' \
'vblaze_name:17171/transcribe?' \
'realtime=true&uttmaxgap=0&vadtype=level&scrubaudio=true¬ext=true' |
play -V0 -q -
These examples transcribe a file named sample1.wav using a V‑Blaze REST API instance running on a server named vblaze_name; both parameters can be changed without altering the function of any commands.
Real-time from Standard In
This method requires V‑Blaze version 7.2 or greater.
For a single part POST, this method requires cURL 7.68.0 or greater.
In this example, Pipe Viewer is used to read and rate limit the file before passing the audio to cURL via standard in.
These examples transcribe a file named sample1.wav using a V‑Blaze REST API instance running on a server named vblaze_name; both parameters can be changed without altering the function of any commands.
Real-time transcription test example
This example requires three terminal sessions. One to receive utterances in real time, one to send audio, and one for API requests. Multiple terminal sessions can be accomplished using terminal multiplexers such as Screen or tmux.
This example uses
localhost
, however, the callbacks and audio source can be pushed to other hosts.
- Receive real-time utterance-by-utterance callbacks
-
Use the following command to run netcat in a loop to receive POSTs:
while true ; do /bin/echo -e 'HTTP/1.1 200 OK\r\n' | nc -l 5556; done
- Send an audio stream via a socket with netcat
-
Use the
pv
command to simulate real-time audio source and decoding by limiting the audio rate.pv -L 16000 mono_pcm_sample.wav | nc -l 5555
Note: Double the audio rate to 32000 for stereo PCMOr more generally:
cat mono_pcm_sample.wav | pv -L 16000 | nc -l 5555 cat /opt/voci/server/examples/sample7.wav | pv -L 16k | nc -l 5555
- Send a request to the ASR server
-
The following ASR parameters below are a good starting point for common situations, but may require adjustments for specific environments or requirements.
curl -F realtime=true -F output=text \ -F vadtype=level -F activitylevel=175 \ -F uttminactivity=1000 -F uttmaxsilence=500 -F uttpadding=250 -F uttmaxtime=15 -F uttmaxgap=0 \ -F datahdr=WAVE -F socket=localhost:5555 \ -F utterance_callback=http://localhost:5556 \ -X POST http://localhost:17171/transcribe
Note: Both the utterance_callback and socket settings are interpreted from the point-of-view of the ASR host. The ASR must be able to resolve and connect to the hosts and ports specified.
datahdr=WAVE
in the API call. The only time you need to specify nchannels, sampwidth, samprate, encoding, and endian, is when the audio files are raw or headerless.RAW or headerless audio
Oftentimes, real-time streaming audio will not include a RIFF header. When transcribing raw or headerless audio, the
datahdr
field is not used to define the file header; raw encoded audio is supported by explicitly providing the information normally provided by the header. This includes at a minimum the sample rate, sample width, and encoding. The byte endianness can also be specified, however the default value of
LITTLE
is usually correct. The following is an example:
curl -F utterance_callback=http://receiver:5556/utterance/index.asp \
-F socket=sender:5555 \
-F samprate=8000 \
-F nchannels=2 \
-F sampwidth=2 \
-F encoding=SPCM \
http://vblaze_name:17171/transcribe
Alternative implementation method: sending realtime audio in the POST
The ASR REST API enables submission of realtime audio directly in the request POST by sending chunks of audio data directly through the HTTP connection as they become available. Voci can provide the
requests_mpstream.py
reference code by request if this style of data flow is required for the deployment architecture.
An example of using
requests_mpstream.py
:
pv -qL 16000 sample1.wav | \
python requests_mpstream.py http://localhost:17171/transcribe - realtime=true output=text \
vadtype=level activitylevel=175 \
uttminactivity=1000 uttmaxsilence=500 uttpadding=250 uttmaxtime=15 uttmaxgap=0 \
datahdr=WAVE \
utterance_callback=http://localhost:5556