Confidence scores
Voci products use a confidence scoring system to represent the ASR engine's estimate of the probability that the correct words were selected during its transcription of speech to text and its classification of certain audio properties.
For example, when a word is decoded from speech, that word is assigned a numeric confidence score value. This confidence score is not an accuracy measure. It is a measure of how confident the ASR engine is that the output was transcribed correctly. In other words, a confidence score indicates the system's measure of probability that it has selected the word most likely to be correct out of all the words it believes a region of speech could represent. These confidence values can be used to filter output at various thresholds.
Score values and calculation
Confidence scores have a value between 0 and 1 for the text that is produced, and they are available at the word-event, utterance, and top level in JSON output. Confidence is calculated for each word event, then averaged up to the utterance level. The top-level confidence score is an average of utterance-level confidence scores.
Confidence scores tend to decrease as calls become more noisy, and as the overall speech signal moves further away from the speech used to train the Voci acoustic and language models. Noise, compression artifacts, and accents all contribute to lower confidence scores.
Interpreting confidence scores
A confidence score is only relevant to the language model used to transcribe the audio, and confidence should not be used to measure performance across multiple models. To measure performance across models, perform Word Error Rate accuracy comparisons against reference transcript data for each of the models.
A confidence score on its own is not meaningful unless you have alternative results to compare against, either from the same recognition operation or from previous recognitions of the same input. Confidence values are relative and unique to each recognition engine. Confidence values returned by two different recognition engines cannot be meaningfully compared. A speech recognition engine may assign a low confidence score to spoken input for various reasons, including background interference, inarticulate speech, or unanticipated words or word sequences.
Confidence scores and clarity
Confidence scores are generated by the ASR engine at the utterance level. Clarity values that appear in the V‑Spark UI and in JSON output's app_data object
are generated by V‑Spark during the transcript analysis phase of call processing.
Clarity and confidence are different ways of looking at the same data. V‑Spark's clarity scores and the ASR engine's confidence scores are a number between 0 and 1 that indicates the ASR engine's confidence in its transcription, where 1 is highest.
Agent and client clarity scores in the app_data
object are averages of the transcription's utterance-level confidence scores. In other words, confidence is calculated for each utterance, while clarity is calculated for each speaker using time-weighted, per-speaker confidence averages.
Clarity and confidence scores are in different parts of the JSON depending on the product that generated them.
This example JSON transcript shows confidence at multiple levels. Each higher level is an aggregate value of its lower-level values, down to the utterance level.
{
"emotion": "Improving",
"donedate": "2021-11-12 11:28:41.802605",
...,
"confidence": 0.84,
"sentiment": "Negative",
"source": "audio.wav",
"lidinfo": {
"lang": "eng",
"speech": 12.67,
"conf": 1.0
},...,
"nsubs": 2,
"utterances": [
{
"emotion": "Negative",
...,
"musicinfo": {
"score": -0.8,
"used": 12.67
},
"confidence": 0.86,
"end": 20.21,
"sentiment": "Negative",
"start": 0.41,
"recvdate": "2021-11-12 11:28:33.118784",
"events": [
{
"confidence": 0.86,
"end": 0.68,
"start": 0.41,
"word": "So"
},
{
"confidence": 0.95,
"end": 0.95,
"start": 0.68,
"word": "thank"
},
{
"confidence": 0.98,
"end": 1.01,
"start": 0.95,
"word": "you"
},...
},
{
"emotion": "Neutral",
...
"musicinfo": {
"score": 0.24,
"used": 0.26
},
"confidence": 0.0,
"end": 8.2,
"start": 6.8,
...
}
This next example shows clarity for an agent and a speaker in a single transcript.
{
"emotion": "Positive",
...
"app_data": {
"silence": "0.349",
"agent_channel": 0,
"agent_clarity": "0.895",
"agent_emotion": "Positive",
"client_emotion": "Improving",
"overall_emotion": "Improving",
"datetime": "2022-11-18 18:26:59 UTC",
"scorecard": {...},
"client_clarity": "0.898",
"overtalk": null,
...
}
Emotion and text analysis
As of V‑Blaze version 7.3 and V‑Cloud version 1.6-2021-10.25, some emotion and text analysis is performed as part of ASR processing. The latest version of V‑Spark does not account for the text analytics values generated by the ASR engine. As a result, transcripts generated by V‑Spark systems using V‑Blaze 7.3 or greater, or using V‑Cloud, may contain redundant data depending on audio properties and transcription parameters. Although these redundant data values describe identical aspects of the transcribed text, they are calculated differently, and they can be distinguished by name or by JSON object hierarchy. These data fields include the following:
ASR Field | V‑Spark Field | Description |
---|---|---|
|
| A top-level The ASR engine and V‑Spark use different value sets for emotion calculation results. Refer to the following topics for more information:
|
| | Lower-level ASR Emotion scores generated in V‑Spark are directly assigned the agent or client classification; this designation may be configured for each channel, or it may have been detected using side classification. |
|
| Each value indicates the system's certainty that it correctly diarized the audio. ASR and V‑Spark diarization scores are calculated differently. |
|
| Both the ASR engine and V‑Spark calculate overtalk, but the |
|
| Both the ASR engine and V‑Spark calculate silence as any audio segment without speech. This may include music or other noise. Values may vary between the two sources because they use different processes to detect speech. |
Diarization scores
Diarization score values displayed in the V‑Spark UI and JSON transcripts function similarly to clarity and confidence scores, but diarization is a completely different metric from clarity and confidence. Diarization is the process by which multiple speakers in mono audio are separated onto separate channels, and a diarization score refers to the ASR engine's certainty that it separated speakers correctly. Diarization and its score values apply only to mono audio. Since clarity and confidence measure the probability that the ASR engine selected the best text match, these values are always applicable when speech is decoded from text.
Other score values
Aside from individual words, other ASR metrics use score values that are similar to confidence scores in theory but vary in practice. Score values typically refer to a parent element, and they are usually labeled
score
in JSON output, but sometimes a score value is a weighted average of other scores in that transcription data.
For example, the
score
value in the
musicinfo
object at the
utterances
level indicates the ASR engine's level of confidence that it correctly classified the speech segment described by the
utterances
object as music. However, if the
musicinfo
object appears at the top level or in a
chaninfo
object, the score value is a time-weighted average of individual utterance scores.