All Voci products use the JSON file format to store transcript data derived from source audio. This data includes the text decoded from speech, along with metadata that describes audio attributes and the results of linguistic and emotional analysis performed by Voci products.
The data fields included in Voci JSON transcription output vary depending on the products, circumstances, and optional features that were used to generate the output. There are two categories of Voci JSON data:
Core ASR data generated by the ASR engine under most circumstances and whenever text is decoded from speech. For example, fields like the top-level
modelelements will always be included in JSON output because they refer to ASR engine and language model attributes. Similarly, the
utterancesarray is included if audio was successfully transcribed because it is generated any time speech is decoded from audio.
Conditional and parameter data generated only under certain conditions or when certain transcription parameters are specified. For example, the top-level
chaninfoobject is included only for stereo or diarized audio, and the top-level
emotionfield is included only when the transcription request includes the stream tag
emotion = true.
Under some conditions, fields with identical names appear at different levels of the JSON data hierarchy. For example, the field
agentscoreis included at the top level only when processing undiarized mono audio with the stream tag
agentid = true. When processing stereo audio with
agentid = true, the
agentscorefield instead appears in the
chaninfoobjects for each channel of audio.
warningtag in the JSON output to see if there were any issues with the transcription.