Ever Wondered: How does speech-to-text software work?

Love it or hate it, you can't avoid it.

Ever Wondered: How does speech-to-text software work?
You use it all the time, but do you know how it really works? Image by Alexandra Ossola

From Siri to sales calls, no one can avoid the blessings and curses of voice recognition software. There are lots of unconventional uses for this technology, including better documenting of patients’ medical history and even flying planes. But one interesting and practical use for voice recognition is for digital dictation —turning spoken words into written text. This is not only convenient for everyday smartphone users,but also for people with learning disabilities (like dyslexia) or who have more trouble writing than speaking. Unlike the frequently frustrating autocorrect function for typed text, speech-to-text software can be up to 99 percent accurate.

Let’s say you want to send a text message to your mom using your smartphone’s speech-to-text software. You’ve already tapped Compose and hit the little microphone button in anticipation of speaking into your phone. There are two crucial elements that you need in order to use your voice recognition software: a working microphone that can pick up your speech and a working Internet connection. Because smartphones are small and have limited space for software, much of the speech-to-text process is conducted on the server. When you speak the words of your message into the microphone, your phone sends the bits of data your spoken words created to a central server, where it can access the appropriate software and corresponding database.

When the data arrives at the server, the software can analyze your speech. Programming-wise, this is the tricky part: The software breaks your speech down into tiny, recognizable parts called phonemes — there are only 44 of them in the English language. It’s the order, combination and context of these phonemes that allows the sophisticated audio analysis software to figure out what exactly you’re saying, like the bread, cheese and sauce that differentiate a pizza from a calzone or a sandwich. For words that are pronounced the same way, such as eight and ate, the software analyzes the context and syntax of the sentence to figure out the best text match for the word you spoke.

In its database, the software then matches the analyzed words with the text that best matches the words you spoke. Before the software was up and running, the software programmers spent many hours connecting the distinct patterns of speech waves that certain words create with the written text of those words. It’s this background that the software draws from when it decides which written words to transmit back to your phone, which then appear on the screen and into the text message composition form. Apple’s software for iPhone covers dictation capabilities for eight languages and their dialects (British, American and Australian English, are all listed separately, for example).

All this in an instant. No sooner have you spoken the words, “Mom, stop feeding human food to my cat,” but you’re pressing the send button on the text message with the same words. You mentally thank speech-to-text programmers who made this possible, even if your cat doesn’t necessarily thank you for the intervention.

Related Posts


All comments are moderated, your comment will not appear on the site until it has been approved.

  1. Another interesting using of speech to text software is checking and testing
    human pronunciation. For this moment only one tools https://speechpad.pw/prononce.php can help in this area.

    Alex, August 15, 2014 at 3:50 pm
post your comment