The utterances array

The top-level utterances array is included in a JSON transcript when any text is decoded from audio. The utterances array is an array of objects, and it contains one object for each utterance.

An utterance is defined in this context as an uninterrupted chain of spoken language by a single speaker. An utterance ends with a period of silence that exceeds a threshold duration or that exceeds the maximum utterance duration threshold. Each object in the utterances array may contain the elements in the following table:

Table 1. Elements in utterances Array Objects

Element

Availability

Type

Description

emotion

All

string

Emotional intelligence consists of both acoustic and linguistic information. Events can be given the following values:

  • Positive

  • Mostly Positive

  • Neutral

  • Mostly Negative

  • Negative

confidence

All

number

A measure of how confident the speech recognition system is in its utterance transcription results.

  • Range between 0 and 1

  • 1 is most confident

end

All

number

End time of the utterance in seconds

recvtz

All

array

An array containing two values:

  • time zone abbreviation of the time zone in which the ASR engine is running

  • offset in seconds from UTC for the time on the ASR engine

sentiment

All

string

Utterance-level linguistic sentiment value:

  • Positive

  • Mostly Positive

  • Neutral (contains no Positive or Negative in the file)

  • Mostly Negative

  • Negative

  • Mixed (contains both Positive and Negative in the file)

Sentiment values are derived from the ratio of positive to negative classifications as determined by sentimentex.

sentimentex

All

array

Contains detailed sentiment information for each utterance. See Receiving emotional intelligence data for an example and explanation of sentimentex data.

gender

All

string

Gender prediction of the speaker

rawemotion

All

string

Acoustic emotion values (version 7.1+):

  • ANGRY

  • NEUTRAL

  • HAPPY

Acoustic emotion values (prior to version 7.1):

  • NONANGRY

  • ANGRY

lidinfo

V‑Blaze version 7.1+

array

The lidinfo section is a global, top-level dictionary that contains the following fields:

  • lang — the three-letter language code specifying the language that was identified for the stream

  • speech — the number of seconds of automatically detected speech that were used to determine the language used in the stream

  • conf — the confidence score of the language identification decision

  • langfinal - added when the language specified in LID is below threshold and not the default language

For example:

"lidinfo": {
                "lang": "spa",
                "speech": 17.46,
                "conf": 1.0
            }

start

All

number

Start time of the utterance in seconds

donedate

All

string

Date and time the utterance transcription was completed by the speech-to-text engine

recvdate

All

string

Date and time the utterance was received by the speech-to-text engine

events

All

array

Contains information about individual words. Each element is a word object that contains the following values:

  • confidence — number indicating the ASR engine's word-level transcription confidence level, expressed as a value between 0 and 1 where 1 is the most confident.

  • end — number indicating the end time of the word in seconds.

  • start — number indicating start time the word in seconds.

  • word — string indicating the normalized word.

  • wordex — string indicating the raw dictionary word. This value may not be present in each object in the events array, and it is often used to disambiguate different pronunciations that have the same spelling.

For example:

"events": [
                {
                    "confidence": 0.69,
                    "end": 2.32,
                    "start": 1.81,
                    "word": "Stephanie"
                },
                {
                    "confidence": 0.76,
                    "end": 2.74,
                    "start": 2.32,
                    "word": "so"
                }

Objects in the events array may contain additional key-value pairs depending on the parameters specified with the transcription request.

metadata

All

object

Speaker and information of the utterance. Each object contains the following values:

  • channel — number indicating the audio channel on which the utterance was recorded.

  • model — string indicating the model that decoded the utterance.

  • source — string indicating the audio file name.

  • nsubs (V‑Blaze 7.1+) — a number indicating the count of substitutions applied for the utterance, not including numtrans counts.

  • uttid — number indicating the utterance segment.

  • substinfo (V‑Blaze 7.1+) — object with detail about substitutions, included in an utterance object when audio is processed with the stream tag substinfo = true and substitutions were performed on the utterance. Includes the following data:

    • subs (V‑Blaze 7.1+) — array that contains one nested array for each substitution performed. Nested arrays contain number elements describing the start and end time in seconds for the substituted speech, and an additional array of objects with string and number values describing the source and substitution performed.

    • nsubs (V‑Blaze 7.1+) — number indicating the count of substitutions applied to the utterance, including numtrans counts.

For example:

"metadata": {
    "uttid": 3,
    "substinfo": {
        "subs": [
            [
                55.38,
                55.74,
                [
                    {
                        "source": "subst_rules",
                        "end": 55.74,
                        "sub": "yeah => yes",
                        "rule": "0",
                        "start": 55.38
                    }
                ]
            ]
        ],
        "nsubs": 1
    },

music

V‑Blaze version 7.3+

boolean

Appears only if audio was processed using the stream tag music = true . Has a value of true if music was detected in the utterance.

musicinfo

V‑Blaze version 7.3+

object

Appears only if audio was processed using the stream tag music = info . Contains the following number values:

  • used is the total audio time seconds during which the utterance contains music.

  • score has a value in the range of -1 to 1, where a negative value means the utterance is not music. A positive value means the utterance is music. A value closer to -1 or 1 indicates the system is more confident in its classification, where -1 and 1 indicate the most confidence.