Speech intelligence

Table 1. Speech intelligence features
FeatureReal-time / Post -callLocal / V‑CloudLanguagesDescription
Emotion BothBothAllClassifies emotion based on combined acoustic features and word sentiment scores.
Emotional intelligence BothBothEnglish onlyVoci's emotion detection feature uses a synthesis of acoustic features and word sentiment scores to determine if a given utterance is Positive, Improving, Neutral, Worsening, or Negative.
Detailed sentiment scoring BothBothEnglish onlyClassifies sentiment based on word usage at the call and utterance level. Custom sentiment rules can also be applied.
Confidence scores BothBothAll Scores words, utterances, and calls for the system's confidence in the transcription results.
Speaker turnsBothBothAllThe number of distinct speaker turns detected in the audio.
Speaker timeBothBothAllMetric with the total audio time in seconds during which words were detected. Also provided is the percentage of total audio time during which words were detected.
Total word countsBothBothAllThe total of number of words spoken in the transcribed audio file.
Language Identification (LID) BothBothEnglish, Spanish, FrenchIf a LID-supported language is detected, the ASR engine will switch to the same model of the detected language.
Gender identification BothBothAllIdentifies speakers as male or female.
Agent identification BothBothEnglish onlyIdentifies which channel is the agent versus the customer.
Music detection BothBothAllAcoustic-based classification model that identified when music occurs. Each utterance is scored -1 to +1, corresponding to the probability that it is music. Music utterances are not transcribed.
Overtalk BothBothAllOvertalk occurs when speakers talk over one another. A recording's overtalk percentage is the count of Agent-initiated overtalk turns as a percentage of the total number of Agent-speaking turns. In other words, out of all of the Agent's turns, it measures how many turns interrupted a Client's turn.
Silence BothBothAllAn utterance is an uninterrupted chain of spoken language by a single speaker. An utterance ends with a period of silence that exceeds a threshold duration or that exceeds the maximum utterance duration threshold.
Text information and word counts BothBothAllThe total number of words is provided for each call depending on parameters. Other counts included are:
  • Number of seconds and and average audio time spent on speech, overtalk (including number of occurrences), and silence.
  • Number of speaker turns.
  • Number of substitutions, when enabled.
Credit card detection BothBothIdentifies which numbers are likely credit cards (n-16 digits) by adding a tag to the transcript metadata file (even if the number was redacted). Luhn numbers are not redacted when detected, and there is no Luhn-specific redaction functionality.
Speaker turnsBothBothAllAdds the number of distinct speaker turns in the audio, for stereo or diarized audio only.

Gender identification

Gender Identification (GID) is a feature that evaluates an interval of audio as the input and outputs either Male or Female. If the submitted interval contains speech spoken by persons of both genders, GID identifies the gender of the person(s) who spoke for the largest portion of that interval.

Refer to Emotion, sentiment, and gender for information on how to use GID.

Music detection

Background music can have a significantly negative impact on transcription accuracy. Enabling music detection can help eliminate this issue by excluding music and other high energy non-speech events from transcription results.

When music detection is enabled, all utterances will be passed through an algorithm to be classified as music or non-music. Utterances classified as non-music will be handled as normal. Utterances classified as music are assumed to contain noisy audio and will not be transcribed.

Important: Utterances classified as music will not be processed by any optional transcription features such as LID, GID, EID, and diarization. Optional transcription features ignore any utterance classified as music.

Refer to Music for more information on this feature.

Emotion and sentiment analysis

Voci's speech scientists have applied machine learning techniques to the analysis of emotion and sentiment. Emotion information is extracted from the acoustic features of speech, while Sentiment is determined by analyzing the text generated from speech.

Computer models, trained using thousands of audio and text samples, are used to determine the emotion or sentiment of each utterance. Additional data indicate the words in the speech utterance that contribute to the computed sentiment. The separate emotion and sentiment values are then combined into a single Emotional Intelligence value that reveals the true voice of customers, so you can get to the heart of their concerns.

Since emotion and sentiment information is captured at the utterance level, Voci can determine how emotion is changing throughout the conversation, and whether the caller is in a more positive state at the end of the call than the beginning.

What are the differences between emotion and sentiment?

Sentiment and emotion are often used interchangeably, but are not the same. Emotion is a psychological state such as fear, anger, or happiness indicated by acoustic features of speech including pitch, speed, tone, and volume along with particular vocabulary used by the speaker that further helps to identify their emotional state.

Sentiment analysis only concerns specific vocabulary used by the speaker and does not take acoustic features of speech into account. Sentiment analysis looks for key words and phrases in a transcript such as "happy," "upset," "frustrated," "cancel," "hate," "angry," "thank you," “great," or “dislike” to evaluate the speaker's sentiment.

Enabling emotion analysis automatically enables sentiment analysis; however, emotion can also be disabled and sentiment analysis enabled separately. Refer to the following links for more information.

Learn more:

Requesting emotional intelligence data

The Voci speech translation servers normally do not compute emotion and sentiment from the audio data. To access this feature, you must request it when you submit the audio data. Contact your Voci sales agent to have this feature added to your license.

The method used to request emotion processing depends on the interface you are using to submit your audio data.

V‑Blaze REST API and V‑Cloud

To obtain emotional intelligence data using the V‑Blaze REST API or the V‑Cloud API, add a new form field to your POST using the name “emotion” with the value “true”.

Using the curl command-line tool, include the emotion parameter with the value set to true to submit test.wav for translation with emotion results, as shown in the following example:

curl -F "emotion=true" -F "file=@test.wav" -XPOST http://server:17171/STREAM

where server should be replaced by the name or IP address of the translation server.

The command above prints a JSON data structure to your terminal. To include the results in a JSON file, append >test.json to the end of your request to redirect the results to a JSON file, as shown in the following example:

curl -F "emotion=true" ... > test.json

These commands apply to all common operating systems, including Windows, Mac, and various forms of Linux. They are intended to be entered in a terminal or command window.

If you know ahead of time that emotional intelligence will always be required, you can request that Voci supports the configuration of your Web API or cloud interface with emotion enabled by default. You can also send a message to support@vocitec.com to have the default changed. In this case, you would not need to add -F “emotion=true” to your commands.

Refer to Emotion, sentiment, and gender for more details on parameters and how they are used.

V‑Blaze Python API

When using the V‑Blaze WebAPI with Python, one of the first steps in submitting audio for translation is to create a Stream object. To request emotion information, include emotion=True when creating the stream, as shown in the following example:

s = Stream("test.wav", emotion=True)

When the transcript is returned in JSON format, the emotion information will be present.

Receiving emotional intelligence data

The methods described in Requesting emotional intelligence data will include emotional intelligence data in the JSON output file that is returned from the transcription server. Each transcribed utterance will have its own emotion value. Emotion values are one of the following:

  • Positive: Only positive emotions or sentiments were detected.

  • Mostly Positive: Most emotion and sentiment values were positive but some negative values were also present.

  • Neutral: No emotions or sentiments were detected, or the numbers of positive and negative sentiment-bearing phrases were equal.

  • Mostly Negative: Most emotion and sentiment values were negative but some positive values were also present.

  • Negative: Only negative emotions or sentiments were detected.

Consider an example consisting of the following statements, “The matter was resolved in a very professional manner. Your employees are very good.” The JSON data returned by the transcription server begins like this:

{
    "emotion": "Positive",
    "source": "sample1.wav",
    "confidence": 0.8, 5 "utterances": [
        {
        "emotion": "Positive",
        "confidence": 0.8,
        "end": 6.48,
        "sentiment": "Positive",
        "sentimentex": [ 
            [ 3, 0 ], 
            [ [ "+", 1, [ 1, 4 ] ],
            [ "+", 1, [ 6, 9 ] ], 
            [ "+", 1, [ 10, 14 ] ] ]
        ],
        "start": 0,
        "recvdate": "2015-05-20 11:35:46.907486",
        "events": [
            {
            "confidence": 0.63,
            "end": 0.66,
            "start": 0.55,
            "word": "The"
            },
            {
            "confidence": 0.61,
            "end": 1.02,
            "start": 0.66,
            "word": "matter"
            },
  1. The first occurrence of emotion is the overall emotion for the file, "Positive" in this case. This overall rating will be included only if all utterances in the file have the same value. For this simple example, there was only one utterance.

  2. The sentimentex (sentiment extension) entry indicates the sentiment value of phrases in the transcribed text. The section is a series of lists.

  3. Note the first in a series of lists in the sentimentex entry; this first list contains the number of positive phrases (3) followed by the number of negative phrases (0)

  4. The other lists in the remainder of the sentimentex entry indicate the location of sentiment-bearing phrases in the text. For example, this line ( [ "+", 1, [ 6, 9 ] ], )indicates the following:

    • The “+” indicates that this is a positive phrase. If it had been negative, “-” would appear instead.

    • The 1 indicates the weight of the phrase. This value is not currently used and will always be 1.

    • The 6 and 9 indicate the index of the first word in the phrase and the index of the first word after the phrase. (Note that the first word is counted as index 0.) This means that the phrase will consist of 3 = 9 – 6 words.

Diarization

Diarization is a language-independent process for evaluating a mono audio file. Diarization presumes two people are speaking and separates that mono audio into distinct channels by categorizing speech into two groups. One group is assigned to channel 0 and the other is assigned to channel 1 in the structured transcript.

The system may perform less effectively when source audio includes hold music, voice recordings, or more than two speakers. Overtalk may also reduce the overall accuracy. However, for typical agent and caller situations with only two speakers, diarization is very effective for separating a call into two distinct channels for enhanced analytics.

Using channel-separated audio will eliminate the possibility of channel-assignment errors and is therefore recommended.

Diarization is available at no additional cost. Adjusting for audio describes how to use the diarize parameter for transcription requests.

Language identification

Language Identification (LID) is a licensed software component that performs spoken language identification. LID is a licensed optional feature that is enabled per transcription request by setting the parameter lid=true .

Given an interval of audio, LID outputs the identity of the single language spoken (for example, "French") in that audio. In cases when a submitted interval contains speech in more than one language, LID identifies the language spoken for the largest proportion of that interval.

For transcription requests with lid=true , the default language model, or model specified in the model stream tag, will be assumed. If a LID-supported language is detected, the ASR engine will switch to the same model of the detected language, like in the following example:

  1. Transcription requested with LID enabled.

  2. North American English call center (eng-us:callcenter) model is the default language model.

  3. ASR Engine detects Spanish.

  4. Language model is switched to the appropriate model.

Note: LID is only supported for English-Spanish and English-French language pairs.
For more information on how to use LID, refer to Receiving language identification information for V-Blaze or Receiving language identification information for V-Cloud.