Audio properties

Audio data properties and quality have a significant impact on ASR output. The sections below provide an overview of the most important factors.

Single-channel (mono) and channel-separated audio

It is important to distinguish between single-channel (mono) and channel-separated audio.

  • In mono audio, all speakers are recorded on a single channel.

  • In channel-separated audio, each speaker is isolated to a distinct channel.

Channel-separated audio makes it possible to transcribe each channel independently and maintain a perfect correspondence between the person speaking and the words spoken. For analytic purposes, it is important to have each speaker on a separate channel.

For example, channel-separated audio not only decouples overtalk from overall accuracy, but it also allows for an objective measurement of the overtalk in calls. However, in single-channel (mono) audio, the greater the overtalk, the lower overall accuracy will be.

Voci employs a process called diarization of mono audio to separate speakers into separate channels. The effectiveness of diarization is decreased when source audio includes hold music, voice recordings, or more than two speakers. Overtalk may also negatively impact diarization accuracy. However, for typical agent and caller situations with only two speakers, diarization is very effective for separating speakers to their own channels for enhanced analytics.

Note: Recording channel-separated source audio instead of mono will typically generate a 10% accuracy increase. Voci highly recommends using channel-separated audio for transcription.

Types of Errors from Mono Transcripts

The following list describes four types of errors that may occur when transcribing mono audio with diarization applied.

Overtalk Word Error

Overtalk word errors are caused by two people speaking at the same time. This creates an unintelligible audio region that cannot be transcribed reliably. Overtalk negatively impacts accuracy whether or not diarization is used.

Diarization Word Error

Diarization word errors are caused by the diarizer splitting channels within a word instead of between words. When this incorrect splitting occurs, each word fragment is transcribed independently, resulting in error.

Diarization Side Error

Diarization side errors occur when the diarizer makes an incorrect assignment and places speech on the wrong channel.

Side Classification Error

Side classification errors are caused by failures to correctly identify the side of the conversation containing the majority of the contact center agent’s speech.

Note: The errors mentioned above do not apply to channel-separated audio.

Types of Errors from Stereo Transcripts

Transcripts of channel-separated stereo audio only contain a single type of error, which is incorrect transcription of an audio region. This is referred to as word error.

Sample rate and bit rate

A digital audio segment's sample rate specifies the number of samples to take from one second of an audio's source material; a high sample rate increases the ability of digital audio to faithfully represent high frequencies. The highest frequency that can be accurately represented is half the sampling rate. Since the human voice typically spans the 40 Hz - 4 kHz range, a typical phone call the audio is sampled at 8 kHz or 8000 samples per second. This is a preferred sampling rate that will result in good transcription.

Bit rate

Most digital audio processing uses these two factors/parameters — sampling rate and bit depth – which comprises the bit rate (Sampling rate x Bit Depth). So a typical phone conversation is (8 kHz sample rate * 8 bits of depth = 64 kbits per second) which is acceptable to produce a good transcription. The optimal bit rate is 8 kHz * 16 bits of depth = 128 kbps.

Bit depth

Bit depth affects the dynamic range of a given audio sample. A higher bit depth allows you to represent more precise amplitudes. If you have lots of loud and soft sounds within the same audio sample, you will need more bit depth to represent those sounds correctly. A typical phone call uses 8-bit depth. For comparison audio CDs use 16-bit depth, whereas DVD/HD audio uses 24 bit-depth.

Codecs

Audio codecs that are not optimized for speech are more damaging to the speech signal and can reduce transcription accuracy. It is important to choose your codecs wisely. The following list describes a variety of audio codecs and their implications on transcription accuracy.

G.711 µ-law / A-law (best option)

64 kbps per channel - low-compression, speech-optimized codec

G.729A/B (average option)

8 kbps per channel - aggressive compression, but optimized for speech

G.723.1 (poor option)

6 kbps per channel - highly compressed with poor quality

Alternative audio codec options:

Opus

Newer, speech-optimized codec providing good results at 32 kbps per channel and higher

MP3

MP3 is a lossy codec optimized for music, not speech. If MP3 is being used for transcription, Voci recommends using at least 32 kbps per channel.

Note: Transcription accuracy decreases as the quality of audio degrades.

Transcoding and compressing audio

Transcoding is the direct digital-to-digital conversion of data from one encoding to another. Transcoding audio files can help conserve disk space and shorten the time it takes to transfer files. The following list describes the four transcode types, each of which has different implications on transcription accuracy.

Lossless-to-lossless (recommended)

No audio information is lost during the lossless-to-lossless transcoding process. Converting from a PCM WAV file to a FLAC file is an example of lossless transcoding, commonly used for saving disk space without compromising on quality. A 10-minute, mono WAV file at 8-bit/16 kHz is 9.8 MB, whereas the same file after FLAC conversion is 5.6 MB.

Lossless-to-lossy

Lossless-to-lossy transcoding eliminates information from the audio signal that is less important for human speech comprehension. However, this loss of information negatively impacts ASR performance. Therefore, this form of transcoding is not recommended.

Lossy-to-lossy

Using any form of to-lossy transcoding will decrease quality. Lossy-to-lossy is even worse because repeated lossy transcoding will cause a progressive loss of quality with each successive transcoding pass. This is known as "digital generation loss" or "destructive transcoding"and is irreversible.

Lossy-to-lossless

Transcoding from lossy-to-lossless is strongly discouraged. The quality of the audio file does not improve and the file size will increase.

Miscellaneous noises

Miscellaneous noises in recorded audio can come from various sources such as TVs, music, nearby people, and so on. Voci employs noise detection techniques to eliminate noises from the signal before sending the audio for further processing.

The noise removal process is not perfect. Audio sources that share the same frequency range as speech are a bit more challenging. In a call center environment, it is not uncommon for nearby call center agents to be recorded. If non-speech elements remain in the audio at transcription time, the overall transcription accuracy will be negatively affected.

The following best practices ensure minimal noise is recorded:

  • Use headsets with high-quality near-field microphones

  • Make the recording workspace as quiet as possible

  • Use high quality telephony and recording equipment