Accuracy
Transcription accuracy is not a single measurement, but a rather detailed analysis pertaining to several different factors that are summarized below. Voci speech scientists apply best practices to improve accuracy against many of these factors.
Domain: Voicemail, Survey, Call Center, Healthcare, Survey, etc.
Applications in which dialogue is easiest (for example, single-caller voicemail) will yield higher accuracy results than a multi-party conversation that may include overtalk. Voci builds language models specifically for these applications to maximize accuracy and speed. Learn more: Language models
Audio quality: compression, codecs, stereo vs. mono.
The noisier and more compressed the audio, the lower the accuracy. Typical telephone audio is encoded with G.711 at a rate of 64 Kbps, and Voci takes this format as a baseline. A lower encoding rate will result in lower accuracy. Recording source audio in dual channel rather than mono will typically result in higher accuracy, as much as a 10% difference. Voci always recommends dual channel. Learn more: Single-channel (mono) and channel-separated audio
Field Tuning: Substitutions
Substitution is an automatic speech recognition (ASR) feature that can automatically correct errors in transcripts. Transcription accuracy in V‑Blaze deployments can be improved using substitution rules to find and replace transcription errors with the corrected text.
Learn more: Substitutions
Field Tuning: OOV
OOV (out-of-vocabulary) is an ASR tuning feature designed to improve transcription accuracy for audio that contains brand- and industry-specific terminology. OOV enhances existing language models with new words and preferential treatment for those words.
Learn more: Out-of-vocabulary (OOV)
Measuring accuracy
The accuracy of any automated speech-to-text (STT) solution is measured against a set of reference transcripts, often referred to as the "ground truth," created manually by professional transcribers. Machine-generated transcripts are compared with the manually created reference transcripts to identify substitution, insertion, and deletion errors in the machine transcripts. Word error rate and accuracy metrics are computed from these error counts.
The accuracy of Voci's STT solutions typically ranges from 75% to 90% depending on the quality of the source audio. Accuracy is impacted by the use of lossy compression algorithms, noise (such as static or chatter), and distortions resulting from low-quality microphones. Recording source audio in a low-compression encoding, such as G.711 or PCM, also increases accuracy. Unlike data in a zip file, when audio is compressed, some of the original audio data is discarded permanently. This is the case whenever audio is encoded using a lossy method such as G.729A/B, GSM 6.10, or MP3.
Voci uses the following measures to determine speech recognition quality:
Word Error Rate (WER) is the strictest measure of recognition quality. Voci calculates the total number of errors (which includes insertions, deletions, and substitutions), divides that by the number of words in the reference transcript, and multiplies the result by 100. The result is the rate of erroneous recognitions. The lower the rate, the better.
Accuracy (ACC) is a term used by most speech-to-text companies. The higher the ACC the better. SCLite, part of the NIST scoring toolkit, calculates accuracy as follows:
S = number of substitution errors
I = number of insertion errors
D = number of deletion errors
N = number of words in the reference transcript
Percent Words Correct (PWC) is similar to ACC, but it only counts substitution errors. This number is often used in the context of text analytics due to the tendency for small errors to cancel each other out when examining a large number of transcripts. This number is also often used as a rough estimate when looking at manual transcripts, since it is very easy to calculate. The higher the PWC, the better.
Refer to Audio data and accuracy for more information on improving accuracy along with the Accuracy.
Additional accuracy improvements
For additional accuracy improvements, Voci recommends tuning methods such as substitutions or custom language models. Substitution is an ASR feature that can automatically correct recurring transcription errors. Refer to Substitutions for more information on creating substitutions.
Custom language models add new words to the dictionary and capture statistical properties of speech that are specific to customer use cases. Custom language models can be combined with substitutions to deliver even higher levels of accuracy.
Contact support@vocitec.com for more information on custom language models.