Vocabulary development

There are two components to the data included with OOV requests: the vocabulary, which consists of the words and phrases that make up the OOV terms; and the dictionary, which maps non-standard terms to sounds (referred to as sound-outs). Sound-outs are optional. If sound-outs are not supplied with the request, the ASR engine will use its standard interpretations of words, and in many cases this is sufficient.

When defining the vocabulary, phrases should be added as a single element, even if some of the words in a particular phrase are common and already in the model's standard vocabulary. Two- to three-word phrases work best. Single-word vocabulary elements work well if the pronunciation of the word stands apart from the pronunciation of the context in which it appears.

Standard vocabulary gaps

OOV anticipates transcription errors by prioritizing uncommon sound combinations that match the predefined vocabulary submitted with the request. To optimize OOV performance, add terms and phrases that someone unfamiliar with the brand or industry would not expect to encounter in a conversation about it.

For example, to add context for the made-up brand name NewBrandName in a telephone conversation, you might add phrases like "you have reached NewBrandName", "thank you for calling NewBrandName and company", or another phrase containing the OOV word that occurs frequently and that, aside from the OOV word, uses relatively common vocabulary.

The model's standard vocabulary is available from Voci support or, for self-hosted V‑Blaze, located in /opt/voci/models/{acoustic model}/{language model}/words.

Defining sound-outs

Sound-outs are optional, and are most useful when using OOV to look for made-up words, or when the relationship between a word's spelling and its pronunciation is otherwise unusual.

Sound-outs may be included in JSON as a list or a single value. If sound-outs are not included in the request's OOV configuration, the ASR engine determines pronunciation based on its internal rules and existing vocabulary.

When crafting sound-outs, it's best to use words that exist in the language's standard vocabulary, and to use words that are said a certain way. Monosyllabic filler words like "uhh" and less predictable sounds like "va" can be interpreted different ways by the engine and reduce accuracy.

For example, consider defining the word Voci in an OOV dictionary. The sound-out "vo chee" seems like a good starting point, but the ASR engine doesn't know precisely how to pronounce vo because it's not a standard vocabulary word. In this case, including the sound-out "woe chee" performs better because there is a precise pronunciation in the standard model for the word woe.

Using non-standard words in sound-outs means the engine must calculate those words' pronunciations, thus possibly introducing inaccuracies. Using standard words means that initial calculation is already done.