ASR capabilities and features

Voci's automatic speech recognition (ASR) engine powers Voci's accurate and scalable Speech-to-Text (STT) solutions. Whether your call volume is measured in hundreds or millions of hours per month, the Voci ASR engine enables you to automatically generate high-quality transcripts from 100% of your speech audio assets.

How does Voci's ASR engine work?

Voci uses deep neural networks and deep belief networks in a proprietary configuration to convert speech to intelligent data. Voci speech recognition uses a combination of assisted and unassisted machine learning and is based on Large Vocabulary Continuous Speech Recognition (LVCSR) technology. LVCSR recognizes phonemes like a phonetic system, then applies a dictionary or language model to produce a full transcript. The accuracy is much higher than just the single word lookup of a phonetic approach, and transcript produced is much easier and faster for contact centers to search and use.

The ASR engine uses language models tuned for telephony-based communications such as customer service call center interactions, voicemail, phone sales, and similar audio. The system caters to continuous, spontaneous, uncooperative speech. Speech of this type typically occurs during a phone call between an agent and a caller, or in a voicemail, where it is typical of callers to leave spontaneous messages.

Spontaneous, uncooperative speech is different from other telephony-based situations, for example a receptionist who is practiced in leaving messages (rehearsed speech), someone reading from a script (read speech), or someone interacting with an interactive voice response (IVR) system (prompted speech).

Table 1. ASR Engine Capabilities






Transcribes digitized audio to text.


Adds punctuation and capitalization. Fully punctuated transcripts significantly improve speech analysis by increasing the understanding of the caller’s intended meaning.

Text Information and Word Counts

The total number of words is provided for each call depending on parameters. Other counts included are:

  • Number of seconds and and average audio time spent on speech, overtalk (including number of occurrences), and silence.

  • Number of distinct speaker turns in the audio, for stereo or diarized audio only.

  • Number of substitutions, when enabled.

Word Substitutions

Rules-Based approach to substituting common miss-identified words in the transcription. This can be run through Voci's AutoSubs process to automatically identify words that are typically transcribed incorrectly.


Classifies emotion based on combined acoustic features and word sentiment scores.

Values include strongly positive, positive, neutral, negative, and strongly negative. Scoring is available at the call and individual utterance level. Raw emotion scoring is also available.

Detailed Sentiment Scoring

Classifies sentiment based on word usage at the call and utterance level.

Values include negative, mostly negative, neutral, mostly positive, and positive.

Confidence Scores

Scores words, utterances, and calls for the system's confidence in the transcription results.

Language Identification (LID)

If a LID-supported language is detected, the ASR engine will switch to the same model of the detected language.

Text Redaction

Redacts numbers from a transcript. Automated numeric redaction reduces PCI/PII risk by automatically finding and eliminating credit card and other sensitive numbers from audio and text.

Audio Redaction

Replaces sensitive segments of an audio file with silence. Automated redaction reduces PCI/PII risk by automatically finding and eliminating credit card and other sensitive numbers from audio and text.

Gender ID

Identifies speakers as male or female.

Speaker Separation

Automatic speaker separation of customer and agent voices when both are recorded on one channel, enabling their utterances to be analyzed independently. This is referred to as diarization.

Music Detection

Acoustic-based classification model that identified when music occurs. Each utterance is scored -1 to +1, corresponding to the probability that it is music. Music utterances are not transcribed.

Agent Identification

Identifies which channel is the agent versus the customer.

Luhn Detection

Identifies which numbers are likely credit cards (n-16 digits) by adding a tag to the transcript metadata file (even if number was redacted). Luhn numbers are not redacted when detected, and there is no "scrub only Luhn numbers" functionality.


Overtalk occurs when speakers talk over one another. A recording's overtalk percentage is the count of Agent-initiated overtalk turns as a percentage of the total number of Agent-speaking turns. In other words, out of all of the Agent’s turns, it measures how many turns interrupted a Client’s turn.

* V‑Cloud implementations do not currently support real-time transcription.

Learn more about the features offered with Voci's ASR engine: