Speech intelligence

Table 1. Speech intelligence features
FeatureReal-time / Post -callLocal / V‑CloudLanguagesDescription
EmotionBothBothAllClassifies emotion based on combined acoustic features and word sentiment scores.
Emotional intelligenceBothBothEnglish onlyVoci's emotion detection feature uses a synthesis of acoustic features and word sentiment scores to determine if a given utterance is Positive, Improving, Neutral, Worsening, or Negative.
Detailed sentiment scoringBothBothEnglish onlyClassifies sentiment based on word usage at the call and utterance level. Custom sentiment rules can also be applied.
Confidence scoresBothBothAll Scores words, utterances, and calls for the system's confidence in the transcription results.
Speaker turnsBothBothAllThe number of distinct speaker turns detected in the audio.
Speaker timeBothBothAllMetric with the total audio time in seconds during which words were detected. Also provided is the percentage of total audio time during which words were detected.
Total word countsBothBothAllThe total of number of words spoken in the transcribed audio file.
Language Identification (LID)BothBothEnglish, Spanish, FrenchIf a LID-supported language is detected, the ASR engine will switch to the same model of the detected language.
Gender identificationBothBothAllIdentifies speakers as male or female.
Agent identificationBothBothEnglish onlyIdentifies which channel is the agent versus the customer.
Music detectionBothBothAllAcoustic-based classification model that identified when music occurs. Each utterance is scored -1 to +1, corresponding to the probability that it is music. Music utterances are not transcribed.
OvertalkBothBothAllOvertalk occurs when speakers talk over one another. A recording's overtalk percentage is the count of Agent-initiated overtalk turns as a percentage of the total number of Agent-speaking turns. In other words, out of all of the Agent's turns, it measures how many turns interrupted a Client's turn.
Silence BothBothAllAn utterance is an uninterrupted chain of spoken language by a single speaker. An utterance ends with a period of silence that exceeds a threshold duration or that exceeds the maximum utterance duration threshold.
Text information and word countsBothBothAllThe total number of words is provided for each call depending on parameters. Other counts included are:
  • Number of seconds and and average audio time spent on speech, overtalk (including number of occurrences), and silence.
  • Number of speaker turns.
  • Number of substitutions, when enabled.
Credit card detectionBothBothIdentifies which numbers are likely credit cards (n-16 digits) by adding a tag to the transcript metadata file (even if the number was redacted). Luhn numbers are not redacted when detected, and there is no Luhn-specific redaction functionality.
Speaker turnsBothBothAllAdds the number of distinct speaker turns in the audio, for stereo or diarized audio only.