Top-level elements

The following table describes the top-level elements included in a JSON transcript.

Refer to V‑Blaze transcription parameters for more information on the stream tags used to generate the elements that appear in these sections.

Table 1. Top-level elements

Element

Availability

Type

Description

agentscore

V‑Blaze version 7.3+

number

Predicts whether the speaker is the agent or client. Expressed as a value between -1 and 1, where a negative value means the speaker is believed to be the client. A positive value corresponds to an agent. A value closer to -1 or 1 indicates the system is more confident in its prediction, where -1 and 1 indicate the most confidence.

Appears at the top level only when processing mono audio that was not diarized and was submitted for transcription with the stream tag agentid=true . Otherwise, agentscore appears under chaninfo.

app_data

V‑Spark only

object

A JSON object that stores metadata and application scores generated by V‑Spark.

asr

V‑Blaze version 6.1+

string

Version number of the automatic speech recognition server being used.

audiosecs

V‑Blaze version 6.1+

number

Duration of audio, in seconds, in the stream.

As of V‑Blaze 7.2, this element will not appear in the JSON output if there was a problem processing audio.

chaninfo

V‑Blaze version 7.3+

array

Appears only for stereo or diarized audio. Contains one object for each audio channel. Each channel object may contain the elements in the following list depending on audio attributes and the stream tags specified with the request.

  • emotion

  • textinfo

  • musicinfo - only appears when there is more than one audio channel

  • agentscore

Elements in the chaninfo array's channel objects contain the same information as the top-level elements described in this table.

client_data

V‑Spark only

object

A JSON object that stores user-supplied call metadata associated with the audio file.

confidence

All

number

A measure of how confident the speech recognition system is in its transcription results. Results range between 0 and 1 with 1 being the most confident.

diascore

V‑Blaze version 7.3+

number

Indicates the level of confidence the system has in its classification of agent and client for audio with two speakers on a single channel. Expressed as a range between 0 and 1, where 1 indicates the best speaker separation.

donedate

All

string

Date and time the file transcription was completed by the speech-to-text engine, meaning the last utterance finished.

emotion

V‑Blaze version 7.3+

string

Describes the emotion trend detected in decoded speech. This trend is calculated based on utterance-level emotion throughout the transcript.

Emotional intelligence consists of both acoustic and linguistic information. Each channel can be given the following values:

  • Positive

  • Improving

  • Neutral

  • Negative

  • Worsening

As of V‑Blaze version 7.3, the emotion field is always included at the top level, and the value describing detected emotion is more dynamic.

The emotion detected toward the end of a call is compared to the emotion detected closer to the beginning. The emotion value describes what the speaker's emotion was, or how speaker emotion changed in transcribed audio.

emotion

V‑Blaze version 7.2 and earlier

value

Describes the emotion detected in decoded speech.

Emotional intelligence consists of both acoustic and linguistic information. Events can be given the following values:

  • Positive

  • Mostly Positive

  • Neutral

  • Mostly Negative

  • Negative

Emotion must be the same for all utterances to be included at the top level. Additional emotion scoring is available in The utterances array.

ended

V‑Blaze version 6.1+

string

Date and time the stream ended. This is most useful for measuring real-time transcription.

As of V‑Blaze 7.2, this element will not appear in the JSON output if there was a problem processing audio.

gender

All

string

The gender identified for the audio.

langinfo

V‑Blaze version 7.1+

string

Breakdown of language information that is added when there was more than one language detected. The dictionary contains several fields:

  • utts - the number of utterances spoken for the language identified

  • speech — the number of seconds of automatically detected speech that were used to determine the language used in the stream

  • conf — the confidence score of the language identification decision

  • time - the number of seconds that the language was identified for the whole stream

For example:

 "langinfo": {
            "spa": {
                "utts": 1,
                "speech": 17.46,
                "conf": 1.0,
                "time": 21.56
            },
            "eng": {
                "utts": 1,
                "speech": 1.35,
                "conf": 0.81,
                "time": 0.93
            }

last_modified

V‑Spark version 4.0.2-1+ only

string

The date and time at which an update to the last_modified field was last triggered in the Elasticsearch record associated with a transcript. If the last_modified field is not present or has no date and time value, its return value is false .

The following events trigger an update to the last_modified field:

  • Creating a new transcript.

  • Updating transcript scores by reprocessing an application.

  • Deleting an application or application category associated with the transcript.

  • Unlinking an application from the transcript's folder, if that application has previously been used to score the transcript.

  • Updating transcript metadata using the API.

Note:

The last_modified field was implemented in V‑Spark version 4.0.2-1. As a result, last_modified is not included in static JSON transcripts generated by older versions. Dynamically generated JSON output, such as that downloaded from the Files View, includes a last_modified field even if its audio record is from an older version.

To add the last_modified field to a transcript's Elasticsearch record that was generated in a version older than 4.0.2-1, an action that triggers an update to last_modified must occur.

license

All

string

Identification information for the license used.

lidinfo

V‑Blaze version 5.6+

object

The lidinfo section is a global, top-level dictionary that contains the following fields:

  • lang — the three-letter language code specifying the language that was identified for the stream

  • speech — the number of seconds of automatically detected speech that were used to determine the language used in the stream

  • langfinal — (V‑Blaze7.1+) Added when the language specified in LID is below threshold and not the default language.

  • conf — the confidence score of the language identification decision

For example:

   "lidinfo": {
                "lang": "spa",
                "speech": 1.35,
                "langfinal": "eng",
                "conf": 0.81
             }

model

All

string containing model name if one model was specified;

array of model names if multiple models were specified

Language model(s) specified for transcription. For example:

"model": "eng1:callcenter"

As of V‑Blaze 7.2, this element will not appear in the JSON output if there was a problem processing audio.

musicinfo

V‑Blaze version 7.3+

object

Appears only for stereo audio in which music was detected when audio was submitted for transcription with the stream tag music=true or music=info .

nchannels

All

number

Number of channels in the audio file unless diarization is set to true, in which a single (1) channel file is broken up into 2 based on speaker separation

As of V‑Blaze 7.2, this element will not appear in the JSON output if there was a problem processing audio.

nsubs

V‑Blaze version 7.1+

number

The number of substitutions applied. This tag will not appear if no substitutions were applied.

This value does not include numtrans substitutions.

rawemotion

All

string

Acoustic emotion values. Possible values in version 7.1+ include:

  • ANGRY

  • NEUTRAL

  • HAPPY

Acoustic emotion values prior to version 7.1 include:

  • NONANGRY

  • ANGRY

recvdate

All

string

Date and time the audio file was received by the ASR engine and placed in queue

recvtz

All

array

An array containing two values:

  • time zone abbreviation of the time zone in which the ASR engine is running

  • offset in seconds from UTC for the time on the ASR engine

requestid

All

string

The unique identifier for the request.

resampleinfo

V‑Blaze version 7.4+

array

Shows sample rates in Hz of the original file and the outputted file when audio was resampled. For example:
 "resampleinfo": { 
    "in": 11025,
    "out": 8000
  }

scrubbed

All

boolean

If true then audio is purified so numbers are all redacted. If false, the data name does not appear in the JSON output.

sentiment

All

string

Linguistic sentiment value:

  • Positive

  • Mostly Positive

  • Neutral

  • Mostly Negative

  • Negative

  • Mixed (contains both Positive and Negative in the file)

sentiment_scores

All

array

Array of length 2. [0]=Positive phrase counts and [1]=Negative phrase counts in the file

source

All

string

The audio file name.

started

V‑Blaze version 6.1+

string

Date and time the stream started. This is most useful for measuring real-time transcription.

streamtags

V‑Blaze version 6.1+

object

A list of the parameters or other values specified by the user. This is useful for debugging and verification. It is also useful for tagging the output with user-level metadata (for example, tags that have meaning to the user for filtering or association). For example:

   "streamtags": {
        "emotion": "xa",
        "lid": true,
        "subst_rules": "<17 chars>",
        "gender": true,
        "rawemotion": "xa",
        "lidutt": true,
        "substinfo": true,
        "lidthreshold": 1.0,
        "subst": true,
        "scrubtext": true,
        "datahdr": "WAVE",
        "nsubs": "true"
    }

substinfo

V‑Blaze version 7.1+

object

Detail for substitutions that is included when substinfo=true .

  • nsubs (V‑Blaze 7.1+) — The number of substitutions applied, including numtrans substitutions.

  • counts (V‑Blaze 7.1+) — array that contains one nested array for each source of substitution data. Nested arrays show the source (string value) and count (number value) of substitutions that were performed on the audio, along with an object containing the transformation patterns used to perform the substitution.

  • numtrans V‑Blaze 7.1+) — array that contains details on individual numtrans substitutions.

For example, the object below shows 4 total substitutions from 2 sources using 3 patterns:

"substinfo": {
    "counts": [
        [
            "subst_rules",
            2,
            {
                "yeah => yes": 2
            }
        ],
        [
            "numtrans",
            2,
            {
                "four => 4": 1,
                "last 4 or => last four or": 1
            }
        ]
    ],
    "nsubs": 4
},

textinfo

V‑Blaze version 7.3+

object

Text metrics for the audio transcript, including the amount of transcribed audio that was silence or contained words, overtalk metrics, and the total number of words spoken. If the initial audio was stereo or diarized mono, textinfo also includes the total number of speaker turns.

utterances

All

array

Each audio file is broken up into segments of speech called utterances. The utterances array contains the word transcripts and corresponding metadata organized by utterances.

warning

V‑Blaze version 5.6.0-3+

string

This field describes a problem or issue that was encountered during transcription. A common example is substitutions errors.