Space, Physics, and Math

Ever Wondered: How does speech-to-text software work?

Love it or hate it, you can't avoid it.

August 15, 2014
You use it all the time, but do you know how it really works? Image by Alexandra Ossola

From Siri to sales calls, no one can avoid the blessings and curses of voice recognition software. There are lots of unconventional uses for this technology, including better documenting of patients’ medical history and even flying planes. But one interesting and practical use for voice recognition is for digital dictation —turning spoken words into written text. This is not only convenient for everyday smartphone users,but also for people with learning disabilities (like dyslexia) or who have more trouble writing than speaking. Unlike the frequently frustrating autocorrect function for typed text, speech-to-text software can be up to 99 percent accurate.

Let’s say you want to send a text message to your mom using your smartphone’s speech-to-text software. You’ve already tapped Compose and hit the little microphone button in anticipation of speaking into your phone. There are two crucial elements that you need in order to use your voice recognition software: a working microphone that can pick up your speech and a working Internet connection. Because smartphones are small and have limited space for software, much of the speech-to-text process is conducted on the server. When you speak the words of your message into the microphone, your phone sends the bits of data your spoken words created to a central server, where it can access the appropriate software and corresponding database.

When the data arrives at the server, the software can analyze your speech. Programming-wise, this is the tricky part: The software breaks your speech down into tiny, recognizable parts called phonemes — there are only 44 of them in the English language. It’s the order, combination and context of these phonemes that allows the sophisticated audio analysis software to figure out what exactly you’re saying, like the bread, cheese and sauce that differentiate a pizza from a calzone or a sandwich. For words that are pronounced the same way, such as eight and ate, the software analyzes the context and syntax of the sentence to figure out the best text match for the word you spoke.

In its database, the software then matches the analyzed words with the text that best matches the words you spoke. Before the software was up and running, the software programmers spent many hours connecting the distinct patterns of speech waves that certain words create with the written text of those words. It’s this background that the software draws from when it decides which written words to transmit back to your phone, which then appear on the screen and into the text message composition form. Apple’s software for iPhone covers dictation capabilities for eight languages and their dialects (British, American and Australian English, are all listed separately, for example).

All this in an instant. No sooner have you spoken the words, “Mom, stop feeding human food to my cat,” but you’re pressing the send button on the text message with the same words. You mentally thank speech-to-text programmers who made this possible, even if your cat doesn’t necessarily thank you for the intervention.

About the Author

Alexandra Ossola

Alexandra (Alex) Ossola earned her B.A. from Hamilton College with a concentration in Comparative Literature. Since graduating, she has served as a tutor and mentor with City Year in Washington, D.C. as well as planned and led high school travel programs to Latin America with Putney Student Travel. After dabbling in many different fields, she, like most curious people, was drawn to science. A lifelong lover of good communication, Alex writes about things she finds interesting, with topics that range widely.



Alex says:

Another interesting using of speech to text software is checking and testing
human pronunciation. For this moment only one tools can help in this area.

Vincent Kenneth Hafford says:

If an internet connection is required for Talk to Text via the mobile phone, then why have I been able to use talk to text on my older phones that haven’t been connected to the internet in years? I mean it acts like the phones says, Ok there’s no internet so I guess I’LL Have To listen and type what you say…
And no I didn’t have Wi-Fi turned on.
I only ask because I like to write poetry, and sometimes at home I’ll grab whichever device is handy ATM to jot down the ideas. And sometimes that’s an older phone, disconnected from the service plan.

Mishika Chawla says:

Not direct answer or even not thoughts moving towards the answer.

Leave a Reply

Your email address will not be published. Required fields are marked *


The Scienceline Newsletter

Sign up for regular updates.