Assisted error identification

Voci provides a Python 2.7 script named findMisDecodings* to assist with locating candidates for substitution.

The findMisDecodings.py script examines time-weighted relative confidence scores of Ngrams to identify short phrases and individual words with low average relative confidence across the working set. An Ngram in this context is a phrase of length “N” where “N” can be 1, 2, 3, or 4.

The script identifies words and phrases with lower-than-average confidence scores. Confidence scores are added to the JSON transcript by the ASR engine. These scores represent the speech recognizer's certainty that the ASR engine has transcribed the word correctly. The score ranges from 0 to 1. Words with a higher than average confidence score are more likely to have been correctly transcribed.

The results from the findMisDecodings script contain substitution candidates that can be verified by listening to the associated portions of audio. Good substitution candidates will have a higher frequency, will have a lower average confidence, and the word or phrase will likely appear out of context.

There are a few requirements that must be met before you can use findMisDecodings to identify substitution candidates. Install Python 2.7 and the Python Module "NLTK" (version 3.2.5 or later) on your system. Then, place the JSON transcripts in a working directory. Once those steps are complete, findMisDecodings.py can be used to find substitution candidates.

Run findMisDecodings.py on the command line (Unix/Linux Shell, PowerShell), indicating where the directory containing the JSON files is located. Parameters must be specified as well. Running findMisDecodings.py without parameters will cause this script to print Help documentation that simply describes proper usage.

How to use findMisDecodings.py

The following example illustrates running the findMisDecodings.py script on a specific directory location with the needed parameters. The command must be entered on a single line.

findMisDecodings.py JSON-directory text-directory report-name minHits maxHits rptLen ch filter agent-channel [exclusion-list]

        

The following list describes each parameter required to run findMisDecodings. The parameters must follow the same order.

JSON-directory

Directory where JSON transcripts are placed

text-directory

Directory where generated text files are placed

report-name

Name of generated report

minHits, maxHits

Phrase frequency will fall within the range specified here. Use -1 for maxHits to indicate no maximum

rptLen

Maximum length of each N-gram report section. Filtering by maxHits or exclusions can shorten the report sections further

ch

Specify the channel to analyze (0, 1, -1 for all) For stereo (2-channel) audio, channel 0 is the left audio channel and channel 1 is the right audio channel

filter

"assoc" to filter by word association strength and "none" to not filter

agent-channel

Set as 0 or 1 to specify the channel the agent is speaking on

exclusion-list

(Optional) Exclusion-list is a file of N-grams with one N-gram per line. These are the N-grams you do not want included in the generated reports

Voci recommends that new users of findMisDecodings.py define a smaller range between minHits and maxHits to gain a better understanding of how the script works. Once you understand how the script works, the minHits and maxHits parameters can be used more effectively. Good settings for minHits and maxHits help find more valuable substitution candidates.

When working with larger data sets, a good strategy is to generate seven reports with minHits and maxHits ranges of (5000, -1), (2500, 5000), (1250, 2500), (625, 1250), (312, 650), (156, 312), (78, 156). The logic behind these divisions is that each report consequently covers a different range of frequencies. The value and impact of each report declines as error frequency drops. Start by working through the (5000, -1) report which only shows phrases found 5000 times or more. For example, if you were to discover a transcription error that occurred 7000 times, the rule to correct it will fix 7000 errors. The value of a substitution rule increases by the amount of errors it corrects.

After validating everything in the first report, move down to the (2500, 5000) report. All phrases in this report occur between 2500 and 5000 times. The value of correction is still high, but not as high as in the (5000, -1) report. The phrases found in each report will be different due to the mutually exclusive frequency ranges specified. The strategy outlined is recommended to get the most valuable corrections in the least amount of time. By working in order from highest to lowest frequency you make the best use of available time, ensuring you get the most pervasive transcription errors corrected before moving on to transcription errors that occur less frequently.

The findMisDecodings.py script generates plain-text versions of the transcripts along with two reports which are assigned extensions ".score" and ".delta".

The Score report is sorted by a score calculated from the frequency of the phrase (Hits) and the magnitude of the confidence displacement (Delta). This sorting scheme places phrases near the top that occur frequently or have confidence scores significantly below average.

The Delta report is sorted by the magnitude of the confidence displacement only. Phrases near the top have the lowest relative average confidence regardless of how often they occur in the working-set transcripts. Therefore, these phrases are the most likely to be erroneous, but some won't occur frequently enough to be of interest. Judgement and validation are still required to determine the usefulness of these substitution candidates.

The following example is a report of substitution candidates provided by findMisDecodings.py .

Substitution Candidates Report
==============================

N = 4, Ngram Count = 4

Delta | Hits | Text                                | Duration | Score
----------------------------------------------------------------------
-0.25 |   1  | shh shh shh shh                     |   0.53   | -0.25
-0.07 |   1  | whalen denzel cunningham jr         |   2.04   | -0.07
-0.04 |   2  | slovenia bosnia herzegovina italy   |   5.00   | -0.08
-0.03 |   1  | denzel cunningham jr whalen         |   2.08   | -0.03

N = 3, Ngram Count = 4

Delta | Hits | Text                                | Duration | Score
----------------------------------------------------------------------
-0.26 |   1  | milan lemons lemons                 |   1.10   | -0.26
-0.05 |   2  | abdel salam amsterdam               |   3.34   | -0.11
 0.02 |   2  | slovenia bosnia herzegovina         |   4.02   |  0.04
 0.04 |   1  | excitement drummond tangent         |   1.56   |  0.04

The lists show N as 4 and 3, where N is equal to the number of words per phrase. The Ngram count is the total quantity of Ngrams found (or specified in parameters), which has been limited to 4 in this example. There are additonal lists provided further down in the report where N is 2, then 1.

The following definitions explain the column titles in the Substitution Candidates Report.

Delta

The confidence for the Ngram relative to the average confidence of the transcripts containing the Ngram

Hits

The total Ngram count.

Text

The Ngram found.

Duration

Refers to the total number of seconds that the phrase occupies across the entire test set.

Score

Represents the Ngram’s relevance score.

Note: The reports provided by the findMisDecodings.py script are meant to provide you with possible substitution candidates. The substitution candidates still need to be located and verified by listening to the associated audio portions.