Automatic Speech Recognition

Use Automatic Speech Recognition (ASR) to quickly and accurately convert audio and video conversations to text in real-time and asynchronously. You can use our ASR to transcribe conversations in multiple languages and customize it for domain specific keywords using custom vocabulary.

For more information on how to process conversations with, see Process a conversation.

Use cases

A few of the many use cases for ASR:

  • Closed Captioning – Add closed captions to recordings and live audio and video streams. Increase accessibility to products and comply with regulatory standards that require closed captions and record keeping.
  • Searchable Media Library – Enable text-based search for audio and video content using transcripts.
  • Search Engine Optimization (SEO) – Improve SEO by using transcripts to tag audio and video content.
  • Transcription – Save time and money by automatically transcribing conversations. Apply ASR transcription to situations in which it is costly to hire human transcribers.

Features ASR includes out-of-the-box support for the following features:

  • Formatting:
    • Redact PII, PCI, and PHI data – Choose to redact one or all categories of PII, PCI, and PHI data from conversations.
    • Redact profanity – Automatically redact profane phrases from conversations.
    • Remove filler words – Remove small meaningless words such as um, ah, and hmm from conversations.
    • Punctuation – Improve readability of transcripts by adding all discernible punctuation, including questions, pauses, and full stops.
    • Inverse Text Normalization – Convert spoken numbers such as date and time, addresses, and currency amounts from words to numerical values. For example, “one thousand five-hundred” becomes 1,500 and “one hundred and twenty-five dollars” becomes $125.00. Also known as numerical formatting.
    • Transcription output format – The default output format is JSON. You can convert this to Markdown or SubRip Subtitle (SRT) format.
  • Multi-language support – Get transcription for any of the supported languages.
  • Speaker separation:
    • Automatic speaker diarization – Automatically detect speakers and assign each message in the transcript to the right speaker.
    • Multiple channels – Get transcription for each channel separately when you process a recorded conversation file with multiple channels.
  • Custom vocabulary (Keyword Boosting) – Bias the ASR to recognize particular domain specific terms that would otherwise not be detected by the general model. For example, you can provide people's names and brand names that are specific to your business.
  • Speaker analytics:
    • Pace – The speed at which the person spoke, in words per minute (WPM). Also called speaker speech speed.
    • Silence time – The time during which none of the speakers said anything.
    • Speaker overlap – Shows if a speaker spoke over another speaker, provided as a percentage of total conversation and overlap time in seconds.
    • Speaker ratio – The total ratio of talk time for one speaker compared to others in the same conversation.
    • Talk time – The amount of time each person spoke during the conversation.
  • Bookmarks – Highlight and summarize key moments from conversations. Quickly get to key moments of a conversation and share those moments with others.
  • Real-time interim results – Interim results are intermediate transcript predictions that are likely to change before the ASR returns its final results during a real-time conversation.
  • Confidence score – An estimate of the reliability of a detected word and message.
  • Word level timestamps – The start and end time of each word in Coordinated Universal Time (UTC) format.
  • Sentence level timestamps – The start and end time of each sentence in Coordinated Universal Time (UTC) format.

Feature availability cross-reference

Different features are available for recorded (async) and real-time (streaming and telephony) conversations:

Redact PII, PCI, and PHI data
Redact profanity
Redact filler words
Numerical formatting
Automatic speaker diarization
Multi-channel supportN/AN/A
Custom vocabulary (keyword boosting)
Silence time
Speaker overlap
Speaker ratio
Talk time
Real-time interim resultsN/A
Confidence score
Word level timestamps
Sentence level timestamps
Closed captions (SRT)
Multi-language support*
(see list)

(en-US only)

(en-US only)

Supported languages for speech to text transcription

Multi-language support for recorded (async) conversations applies to speech-to-text transcription.

Supported LanguagesCode
English (United States)en-US
English (United Kingdom)en-GB
English (Australia)en-AU
English (Ireland)en-IE
English (India)en-IN
English (South Africa)en-ZA
English (New Zealand)en-NZ
Russian (Russian Federation)ru-RU
French (Canada)fr-CA
French (France)fr-FR
French (Luxembourg)fr-LU
French (Switzerland)fr-CH
German (Germany)de-DE
German (Austria)de-AT
German (Belgium)de-BE
German (Luxembourg)de-LU
German (Switzerland)de-CH
Italian (Italy)it-IT
Italian (Switzerland)it-CH
Dutch (Netherlands)nl-NL
Japanese (Japan)ja-JP
Spanish (United States)es-US
Spanish (Spain)es-ES
Arabic (Saudi Arabia)ar-SA
Hindi (India)hi-IN
Portuguese (Brazil)pt-BR
Portuguese (Portugal)pt-PT
Persian (Iran)fa-IR

Conversation intelligence not only provides ASR-generated transcripts, but also the conversation intelligence to get actionable insights from your conversations. After processing a conversation, you can use our Conversations API to get a wide range of conversation intelligence.

Next steps

  • Get started – Quickly find and start using the tools and technologies that meet your needs.
  • Process a conversation – Submit an async, streaming, or telephony conversation to receive a conversation ID.
  • Get messages – With a conversation ID, you can generate a transcript including your choice of insights.
  • Conversation intelligence – Use your completed transcript to access all conversation intelligence features.

Further reading