Confidence scores

Voci products use a confidence scoring system to represent the ASR engine's estimate of the probability that the correct words were selected during its transcription of speech to text and its classification of certain audio properties.

For example, when a word is decoded from speech, that word is assigned a numeric confidence score value. This confidence score is not an accuracy measure. It is a measure of how confident the ASR engine is that the output was transcribed correctly. In other words, a confidence score indicates the system's measure of probability that it has selected the word most likely to be correct out of all the words it believes a region of speech could represent. These confidence values can be used to filter output at various thresholds.

Score values and calculation

Confidence scores have a value between 0 and 1 for the text that is produced, and they are available at the word-event, utterance, and top level in JSON output. Confidence is calculated for each word event, then averaged up to the utterance level. The top-level confidence score is an average of utterance-level confidence scores.

Confidence scores tend to decrease as calls become more noisy, and as the overall speech signal moves further away from the speech used to train the Voci acoustic and language models. Noise, compression artifacts, and accents all contribute to lower confidence scores.

Interpreting confidence scores

A confidence score is only relevant to the language model used to transcribe the audio, and confidence should not be used to measure performance across multiple models. To measure performance across models, perform Word Error Rate accuracy comparisons against reference transcript data for each of the models.

A confidence score on its own is not meaningful unless you have alternative results to compare against, either from the same recognition operation or from previous recognitions of the same input. Confidence values are relative and unique to each recognition engine. Confidence values returned by two different recognition engines cannot be meaningfully compared. A speech recognition engine may assign a low confidence score to spoken input for various reasons, including background interference, inarticulate speech, or unanticipated words or word sequences.

Confidence scores and clarity

Confidence scores are generated by the ASR engine at the utterance level. Clarity values that appear in the V‑Spark UI and in JSON output's app_data object are generated by V‑Spark during the transcript analysis phase of call processing.

Clarity and confidence are different ways of looking at the same data. V‑Spark's clarity scores and the ASR engine's confidence scores are a number between 0 and 1 that indicates the ASR engine's confidence in its transcription, where 1 is highest.

Agent and client clarity scores in the app_data object are averages of the transcription's utterance-level confidence scores. In other words, confidence is calculated for each utterance, while clarity is calculated for each speaker using time-weighted, per-speaker confidence averages.

Clarity and confidence scores are in different parts of the JSON depending on the product that generated them.

This example JSON transcript shows confidence at multiple levels. Each higher level is an aggregate value of its lower-level values, down to the utterance level.

{
    "emotion": "Improving",
    "donedate": "2021-11-12 11:28:41.802605",
    ...,
    "confidence": 0.84,
    "sentiment": "Negative",
    "source": "audio.wav",
    "lidinfo": {
        "lang": "eng",
        "speech": 12.67,
        "conf": 1.0
    },...,
    "nsubs": 2,
    "utterances": [
        {
            "emotion": "Negative",
            ...,
            "musicinfo": {
                "score": -0.8,
                "used": 12.67
            },
            "confidence": 0.86,
            "end": 20.21,
            "sentiment": "Negative",
            "start": 0.41,
            "recvdate": "2021-11-12 11:28:33.118784",
            "events": [
                {
                    "confidence": 0.86,
                    "end": 0.68,
                    "start": 0.41,
                    "word": "So"
                },
                {
                    "confidence": 0.95,
                    "end": 0.95,
                    "start": 0.68,
                    "word": "thank"
                },
                {
                    "confidence": 0.98,
                    "end": 1.01,
                    "start": 0.95,
                    "word": "you"
                },...
        },
        {
            "emotion": "Neutral",
            ...
            "musicinfo": {
                "score": 0.24,
                "used": 0.26
            },
            "confidence": 0.0,
            "end": 8.2,
            "start": 6.8,
            ...
}

This next example shows clarity for an agent and a speaker in a single transcript.

{
  "emotion": "Positive",
  ...
  "app_data": {
    "silence": "0.349",
    "agent_channel": 0,
    "agent_clarity": "0.895",
    "agent_emotion": "Positive",
    "client_emotion": "Improving",
    "overall_emotion": "Improving",
    "datetime": "2022-11-18 18:26:59 UTC",
    "scorecard": {...},
    "client_clarity": "0.898",
    "overtalk": null,
    ...
}

Emotion and text analysis

As of V‑Blaze version 7.3 and V‑Cloud version 1.6-2021-10.25, some emotion and text analysis is performed as part of ASR processing. The latest version of V‑Spark does not account for the text analytics values generated by the ASR engine. As a result, transcripts generated by V‑Spark systems using V‑Blaze 7.3 or greater, or using V‑Cloud, may contain redundant data depending on audio properties and transcription parameters. Although these redundant data values describe identical aspects of the transcribed text, they are calculated differently, and they can be distinguished by name or by JSON object hierarchy. These data fields include the following:

Table 1. ASR and V‑Spark Analytics JSON element overlap
ASR Field	V‑Spark Field	Description
`emotion`	`overall_emotion`	A top-level `emotion` value that represents a calculation based on lower-level values. Although `emotion` and `overall_emotion` describe the same metric, they are calculated differently by the ASR engine and V‑Spark. The ASR engine and V‑Spark use different value sets for emotion calculation results. Refer to the following topics for more information: Top-level elements for the ASR engine's `emotion` field. Other score values for V‑Spark's `overall_emotion` field and its speaker-specific `agent_emotion` and `client_emotion` fields.
`chaninfo > emotion`	`app_data > agent_emotion` and `app_data > client_emotion`	Lower-level `emotion` values are more limited in scope than top-level `emotion` elements; a lower-level value may be derived from a single channel or utterance, but a top-level value is calculated using lower-level values. ASR `emotion` fields are listed in the `chaninfo` object that corresponds to the channel in which those emotions were detected. Those channels correspond to agent or client depending on the `agentscore` for the channel object. Emotion scores generated in V‑Spark are directly assigned the agent or client classification; this designation may be configured for each channel, or it may have been detected using side classification.
`diascore`	`diarization`	Each value indicates the system's certainty that it correctly diarized the audio. ASR and V‑Spark diarization scores are calculated differently.
`textinfo > overtalk`	`overtalk`	Both the ASR engine and V‑Spark calculate overtalk, but the `overtalk` object in the `textinfo` object generated by the ASR engine provides a more detailed breakdown of the underlying data.
`textinfo > silence`	`silence`	Both the ASR engine and V‑Spark calculate silence as any audio segment without speech. This may include music or other noise. Values may vary between the two sources because they use different processes to detect speech.

Diarization scores

Diarization score values displayed in the V‑Spark UI and JSON transcripts function similarly to clarity and confidence scores, but diarization is a completely different metric from clarity and confidence. Diarization is the process by which multiple speakers in mono audio are separated onto separate channels, and a diarization score refers to the ASR engine's certainty that it separated speakers correctly. Diarization and its score values apply only to mono audio. Since clarity and confidence measure the probability that the ASR engine selected the best text match, these values are always applicable when speech is decoded from text.

Other score values

Aside from individual words, other ASR metrics use score values that are similar to confidence scores in theory but vary in practice. Score values typically refer to a parent element, and they are usually labeled score in JSON output, but sometimes a score value is a weighted average of other scores in that transcription data.

For example, the score value in the musicinfo object at the utterances level indicates the ASR engine's level of confidence that it correctly classified the speech segment described by the utterances object as music. However, if the musicinfo object appears at the top level or in a chaninfo object, the score value is a time-weighted average of individual utterance scores.