Transcription

A transcript displays the spoken word in a video or audio file as text. Medallia Video provides a time-aligned transcript for every video and audio file in a supported language.

Video supports these types of transcription:

Machine speech-to-text — May also be referred to as automatic speech recognition, computer speech recognition, machine transcription or machine speech recognition.
Human transcription
Speaker separation

Fore more information, see Supported languages.

Machine speech-to-text

Machine speech-to-text automatically identifies words and phrases in spoken language and renders them as text.

Sound waves from a video or audio file are formatted and processed using a recurrent neural network (a computer system modeled on the human brain and nervous system) to predict and transcribe one letter at a time. Recurrent neural networks have memory that can help predict what the next letter will be. The initial process returns several predicted words. This output is polished to produce the most likely transcription based on the language and its common uses.

Machine speech-to-text transcription accuracy varies depending on several factors:

Quality of the audio — Poor quality microphones, muffled sound and background noise will impact the ability to pick-up and correctly identify the spoken word.
Multiple speakers — Video or audio where multiple people are talking will be less accurate if individuals talk over one another.
Accents — Strong dialects/accents may impact the ability to pick up or correctly identify some words during natural language processing.
Niche topics — Machine speech-to-text recognizes a wide range of words. However, specialize, technical language and industry-specific terminology may not be automatically recognized.

Good quality spoken English audio averages 80%+ accuracy.

Zoom import transcripts are classified as Machine speech-to-text.

Video transcripts created by Machine speech-to-text transcription are labeled (Auto) and with with a computer icon on the language tab, as shown in the image below.

LivingLens transcript with machine speech to text labels.

Note: Video channels are set to automatically process Media Capture and CaptureMe uploads for Machine Speech-to-text, for all Supported languages.

Speech-to-text transcription can be processed for custom dictionary find and replace and profanity masking. For more information, see Custom dictionary.

Human transcription

Human transcription from Video' global network of approved suppliers produces caption-quality, time-stamped (.VTT or .SRT) transcripts of the spoken word in video and audio files. The API-enabled process returns the completed transcript orders to Video.

Restriction: Human transcription is not supported on Video Experience Edition. Contact your Medallia representative for more information.

Important: Human transcription incurs additional cost. Contact your Medallia representative for pricing information.

Video transcripts created by Human transcription are labeled with a speech icon, as shown in the image below.

LivingLens transcript with the Human transcription label.

Human transcription can be processed for custom dictionary find and replace and profanity masking. For more information, see Custom dictionary.

Upgrading Machine speech-to-text to Human transcription

Role required: Customer Admin, Channel Admin, or Pro. See Video user roles.

You can upgrade machine transcription to Human transcription. If Human transcription is included in a Medallia Video subscription, costs incurred are billed monthly in arrears. Contact your Medallia representative for pricing information.

For more information, see Ordering a transcription.

Speaker separation

Speaker separation is a transcription solution that identifies the number of speakers in a video or audio file and displays each numbered speaker in the transcript. Speakers are numbered in the platform transcript field and in the Data export transcript fields. Speakers are named in human transcription when possible.

Speaker separation is available alongside:

Machine speech-to-text transcription for several languages. For more information, see Supported languages.
English Human transcription.

The following example shows a Speaker Separation transcript where two speakers are identified and numbered as 1 (speaker 1) and 2 (speaker 2).

A transcript field with Speaker separation identifying two speakers.

Best practice is to use Speaker Separation when media contains two or more speakers.

Speaker separation quality and performance is dependent on the audio quality. Unclear audio quality, overlapping speakers, or presence of background noise can cause inaccurate speaker identification and numbering.

Restriction: Speaker separation is not available for Zoom-produced transcripts nor for uploading .SRT or .VTT subtitle files.