Measuring accuracy

The accuracy of any automated speech-to-text (STT) solution is measured against a set of reference transcripts, often referred to as the "ground truth," created manually by professional transcribers. Machine-generated transcripts are compared with the manually created reference transcripts to identify substitution, insertion, and deletion errors in the machine transcripts. Word error rate and accuracy metrics are computed from these error counts.

The accuracy of Voci's STT solutions typically ranges from 75% to 90% depending on the quality of the source audio. Accuracy is impacted by the use of lossy compression algorithms, noise (such as static or chatter), and distortions resulting from low-quality microphones. Recording source audio in a low-compression encoding, such as G.711 or PCM, also increases accuracy. Unlike data in a zip file, when audio is compressed, some of the original audio data is discarded permanently. This is the case whenever audio is encoded using a lossy method such as G.729A/B, GSM 6.10, or MP3.

Voci uses the following measures to determine speech recognition quality:

Word Error Rate (WER) is the strictest measure of recognition quality. Voci calculates the total number of errors (which includes insertions, deletions, and substitutions), divides that by the number of words in the reference transcript, and multiplies the result by 100. The result is the rate of erroneous recognitions. The lower the rate, the better.

Accuracy (ACC) is a term used by most speech-to-text companies. The higher the ACC the better. SCLite, part of the NIST scoring toolkit, calculates accuracy as follows:

S = number of substitution errors

I = number of insertion errors

D = number of deletion errors

N = number of words in the reference transcript

Percent Words Correct (PWC) is similar to ACC, but it only counts substitution errors. This number is often used in the context of text analytics due to the tendency for small errors to cancel each other out when examining a large number of transcripts. This number is also often used as a rough estimate when looking at manual transcripts, since it is very easy to calculate. The higher the PWC, the better.

Refer to Audio data and accuracy for more information on improving accuracy along with the Factors affecting accuracy.