Native voice-to-text: Can you here mi noun?

I’m again experimenting with voice-to-text transcription. I keep hoping for that Star Trek experience — you know, where the characters whap their comm badge, speak with normal speed and diction, and have the computer fully understand them.

But for me, that’s truly fiction. In real life, with free-form text, voice to-text accuracy is likely to range from meh to awful.

For example, I recently used voice-to-text in a jokey email to my wife in which I said:
“You may have missed it, lucky girl.”

Voice to text rendered it as:
“You may have missed it, like a cow.”

I’m glad I proofread that before I sent it!

(I bet my wife is, too!)

That kind of not-even-close transcription is all too common with voice to text. And even when the errors are less severe, they still require much back-pedaling, correction, and revision before your text is ready for anything but the most casual of uses.

On more than one occasion, I’ve dictated a paragraph or two into a voice-to-text application, but when I later tried to read the transcript, so many words were mangled that I couldn’t figure out what I was originally trying to say!

Still, it had been a while since I explored voice to text, and with all the current activity (OK Google, Siri, Alexa, Bixby, etc.,…) I thought it was time for a fresh look at the kinds of native text-to-speech being built into today’s devices and operating systems.

I wasn’t so much interested in exploring the pre-scripted commands that those apps can respond to — with a context-limited universe of words to choose from, most of these applications do pretty well.

You can, for example, say to an Android phone, “OK Google, navigate from here to Boston Common by foot.” Google will open Maps, find the best walking route from your current location to Boston Common (or wherever you specify), and launch turn-by-turn directions.

That’s cool, but it’s not free-form, flowing, natural speech. True natural speech to text transcription is a whole ‘nother ballgame.

Want to see some real-life examples? Below, you’ll find four transcripts of voice-to-text output, as produced by four different combinations of hardware and software.

They’re all versions of the following sample paragraph that I wrote manually — fingers on keyboard. The sample paragraph also explains more of the setup:

First, I’ll read this paragraph using Google’s “voice typing” (the voice-to-text option built into the Gboard keyboard on my Android Samsung S8 phone). I’ll read it using the phone’s built-in mic, and then again using a brand new Bluetooth 4.1 headset optimized for dictation (condenser microphone on a boom). Next, I’ll read this paragraph using Windows 10’s built-in dictation function (Windows Key+H) both with my PC’s built-in microphone, and with the same Bluetooth headset used for the Android versions.

But before I show you the results, let me stipulate that none of these transcription apps handle punctuation very well. The Android app is supposed to recognize punctuation such as “open parentheses” and “close parentheses,” but it’s spotty at best. Windows 10 can understand that you mean a “;” when you say the word “semicolon,” but it falls down in areas such as recognizing that you want to start a new paragraph when you say “new paragraph.” (It inserts the phrase “new paragraph” into the body of your text.)

The poor handling of punctuation is a real problem unless you’re producing very simple texts. But it’s a whole different kind of trouble than, say, mistaking “lucky girl” for “like a cow!”

Now, back to the results: Up first, the nearly-perfect output of Google’s Gboard voice typing on a Galaxy S8, with a good-quality dictation headset connected via Bluetooth 4.1; with the phone connected to the internet via my office 5GHz Wifi router.

As you can see, the punctuation and capitalization are funky, but the words themselves are OK.

First I’ll read this paragraph using Google’s voice typing The Voice to Text option built into the gboard keyboard on my Android Samsung S8 phone. I’ll read it using the phone’s built-in mic, and then again using a brand new Bluetooth 4.1 headset optimized for dictation condenser microphone on a boom. Next, I’ll read this paragraph using Windows 10 built-in dictation function Windows key + H both with my PCS built-in microphone, and with the same Bluetooth headset used for the Android versions.

Next up: A surprise — I also got the same nearly perfect results using the phone’s standard, built-in mic instead of the separate headset!

First, I’ll read this paragraph using Google’s voice typing The Voice to Text option built into the gboard keyboard on my Android Samsung S8 phone. I’ll read it using the phone’s built-in mic, and then again using a brand new Bluetooth 4.1 headset optimized for dictation condenser microphone on a boom. Next, I’ll read this paragraph using Windows 10 built-in dictation function Windows – key + H both with my PCS built-in microphone and with the same Bluetooth headset used for the Android versions.

Next, in the third example, you’ll see that Windows 10’s built-in dictation function (Windows Key+H) didn’t do as well. The PC itself shouldn’t have been a problem— it’s an SSD-based 64-bit, 2.4GHz Core i7, with a wired Ethernet connection to my office router. But here’s the result of reading the above sample paragraph, using the PC’s built-in microphone. I’ve highlighted the worst non-punctuation/verbiage errors in red.

First, I’ll read this paragraph using Google Voice typing Voice to text option built into the Jeep board keyboard on my Samsung S 8 phone. I’ll read it using the phone’s built in mic, and then again using a brand new Bluetooth 4.1 headset optimized for dictation condenser microphone on a boom. next, I’ll read this paragraph using Windows 10 built in dictation function windows key plus H both with my PC’s built in microphone, and with the same blue tooth headset use for Android versions.

I can understand mistaking Gboard for Jeep board, I guess, but shouldn’t Windows recognize the phrase “Windows Key?” And how can it not know Bluetooth?

The fourth and last test was very odd: Using my dictation headset (same as above) actually made Windows 10’s built-in dictation function a little worse! I have no idea why, because using a good headset usually improves voice recognition. Not this time.

First, out read this paragraph using Google’s voice typing voice to text option built into the Jeep board keyboard on my Android Samsung S 8 phone. I’ll read it using the phone’s built in mic, and then again using a brand new Bluetooth 4.1 headset optimized for dictation condenser microphone on a boom. Next, I’ll read this paragraph using windows tens built in dictation function windows key plus H both with my kisise built in microphone and with the same blue tooth headset used for Android version.

Along with the same errors mentioned above, this iteration mistook “I’ll” for “out” and turned “my PC’s” into “kisise.” That last is truly baffling: It’s not even an English word, and makes at least that part of the sentence wholly unintelligible.

That many fundamental verbiage errors, along with a dozen or so punctuation and capitalization errors is, to me, unacceptable in so brief a paragraph (just 83 words). For me, it’s still much faster and cleaner to type.

I love the convenience of voice-to-text (especially when there’s no keyboard, or only a virtual/on-screen keyboard available). And voice to text can work quite well indeed if you stick to scripted commands (words and phrases that fall within the context of what the software is already expecting), and use careful diction. But that’s not normal speech.

For natural, freeform, floating speech, voice to text still isn’t quite ready for prime time.

Those Star Trek comm badges are going to have to wait!


Permalink: https://wp.me/paaiox-7N

Ask me anything! Click the CONTACT link on any page.

Share this item via the links below:

2 Replies to “Native voice-to-text: Can you here mi noun?”

  1. Fred
    I wonder if the apps get better after multiple uses – in other words do they learn your speech patterns?
    I first got involved looking at this technology almost 40 years ago when we needed a cheap input method for a new product. We only needed numeric input, but had to settle for DTMF from the phone key pad.
    I did keep looking but all the systems required you to teach the system your voice – no good for a mass market product!

    1. The built-in Windows 10 dictation requires no specific training. The older stand-alone Windows speech-to-text app did let you train it. In an effort to improve the transcription quality, I did go through the training for the older version of the app, but I don’t know it carried over to the new version — there was no discernible difference pre and post training. I think it’s simply that Google’s voice modeling is more accurate than the Windows 10 native one.

Comment? Question? Reply...?