Skip to main content

Why GPS voices are so condescending

John D. Sutter
Finding your way can be difficult if you think your GPS device sounds like a harpy. Voice technology aims to change that.
Finding your way can be difficult if you think your GPS device sounds like a harpy. Voice technology aims to change that.
STORY HIGHLIGHTS
  • Computers still don't sound human, even after years of research
  • But technologists are making some strides
  • IBM is trying to make computer voices convey emotion
  • Professor says we attribute human characteristics to computer voices
RELATED TOPICS

(CNN) -- In this tech-saturated world, few things are more annoying than car navigation systems that yell at you for making a wrong turn.

"Re-CALC-ulating," the system says in that condescending robot voice, as if it is offended by having to rethink the route.

"Turn left at ... [sigh] ... recalculating ..."

Such interactions lead people to think GPS devices are nagging them, said Mark Gretton, chief technology officer of TomTom, a GPS maker.

"The main interaction you have with the device is a series of commands, so that starts the tone of the relationship right from the start," he said. "It's 'Do this, do that, turn right.' " And it doesn't help if the computer sounds snippy, he said.

Despite advances in "text to speech" technology, current computer voices can still be socially tone deaf. Car systems are bossy. E-readers read to us aloud, but they don't know what they're reading, so Shakespeare can sound just like a monotone reading of a spreadsheet.

None of them can get intonations, pauses or emotional context quite right.

Farhad Manjoo, a tech columnist at Slate, compared the Amazon Kindle's reading voice, for example, to "Gilbert Gottfried laid up with a tuberculin cough" and "a dyslexic robot who spent his formative years in Eastern Europe."

So what gives? With more than a decade of voice research under our belts, why can't computers speak our language -- or at least sound a bit more human?

Well, they're trying, tech researchers say, but these machines face a striking number of technological hurdles in their efforts to sound un-robotic.

Complex speech patterns

The most obvious reason the computers have trouble is that human speech is almost infinitely complex. There are about 40 phonemes -- or basic sounds -- in the English language, but there are seemingly limitless combinations.

To try to get computers on the right track, voice technologists record human actors reading all kinds of wacky sentences, which are designed to elicit as many phoneme combinations as possible.

Computers store all these sentences in a database, chop them into sounds, and then remix them to make any possible combination of words.

The result is intelligible, but it's not quite human.

A super-high-quality computer voice might require 40 hours of voice recordings in order to sound nearly human, said Andy Aaron, a computer speech researcher at IBM.

That's just for one voice, one accent.

Computing power

Aaron said computers that have lots of voice data to pull from can sound, at times, nearly human.

But the issue is that not every computer has an entire server farm waiting to process every sentence it would like to say.

Mobile phones and GPS devices, in particular, just don't have enough computing power or storage space to thumb through mountains of voice files in order to sound as realistic as possible with current technology.

The result: Corners are cut in the name of workability, and some of the nuances of the spoken language are lost, said Gretton of TomTom.

This will improve as computers continue to get faster and able to store more data, he said.

Parts of computer voices are also generated entirely from equations and models, not actor-read sounds.

Those bits act as filler, and cut down on database sizes, too.

Speak thy heart

Another major problem for talking computers is that it's somewhat difficult for them to replicate the sound of human emotion and inflection.

This, however, is a major topic of speech research, and the technology appears to have made some strides. People who record the sentences that are the grist for computer speech sometimes are asked to read in different emotional states. Computers can pull from these sounds if they want to flip the pitch of a computer voice up at the end of a sentence, for example, in order to ask a question. Or they pull from higher frequencies to sound happy or excited.

IBM Research has posted a demo of this on its website in order to show the differences between emotive and robotic computer voices.

Take this example sentence:

"These cookies are delicious."

Listen to that sentence as read by a computer with no emotion.

Here it is again, spoken by a computer using a system called Naxpres, which tries to take emotional cues into account. Notice that the voice perks up at the end, as if the computer is saying the cookies are "de-LISH-ious."

It makes some difference.

Emotional context

But copying the sounds of human emotion is only half the battle. To really make computers sound more human, the machines have to understand what they're reading -- at least to some degree -- so that they know when to inflect.

This part of computer science is much more challenging, said Aaron.

Consider another sample sentence:

"I say tomato, and you say tomato."

Most people would have heard that line before, and would automatically pronounce "tomato" as "to-mah-toe" the second time, said Aaron, of IBM.

But not a computer.

"How would the computer know that those two words are supposed to be pronounced differently?" Aaron said. "It's only real-world knowledge that can tell the computer that those two words are supposed to be pronounced differently."

The same applies to emotions and inflections. It's difficult for a computer to know how to read a passage of text, and what emotions should apply.

"If you read a passage to somebody, you're obviously going to read it a way that does justice to the content," said Vlad Sejnoha, chief technology officer of Nuance, a company that develops speech technologies.

"If you're reading a technical report, you're probably not going to read it in a way that's much different from a computer, but if you're reading a poem, it's a different kettle of fish," he said. "You're really trying to communicate a lot of emotional meaning through the pauses you introduce and through the pacing and such. That really requires a pretty deep understanding" of language.

'You want to punch them'

As it turns out, the best computer voices may be those that sound exactly like the person who's listening. If a computer voice matches your mood, your speech patterns, your accent and your tonal range, you're less likely to be annoyed by it, researchers said.

How well a computer voice matches the listener's mood is not just a matter of preference -- it's a matter of safety, said Clifford Nass, a Stanford professor who studies computer voices.

In a 2005 study, Nass found that these emotional mismatches may actually be dangerous in driving situations. Sad drivers who get instructions from happy computer voices -- and happy drivers who listen to sad voices -- are more likely to have accidents, he said. The emotionally confused drivers are also less likely to be able to pay attention to the road.

So, if you're having a groggy sort of morning, instructions from a GPS device that sounds like a caffeinated cheerleader might just push you over the edge.

"If you think about it, when you're happy, you want to be around happy people. But if you're sad, do you really want to hang around chirpy, happy people saying, 'Let's turn that frown upside down?'" he said. "No. You want to punch them."

Sejnoha, from Nuance, said his company has developed a prototype computer voice system that listens to a person speak and then tries to mimic it.

Gretton, from TomTom, said his company hasn't looked into matching drivers' emotions to the voices of their navigation systems yet.

But one interim solution, he said, gives drivers many options when it comes to the voices of their computerized companions.

TomTom offers a range of downloadable voices -- from the fictional Darth Vader and Homer Simpson to celebrities like the rapper Snoop Dogg.

Users can also read a set of test sentences and have their own voices transferred into the GPS -- so that they're, in effect, bossing themselves around.

Perhaps it's a little less tempting to yell at the computer if the computer sounds exactly like you do -- or as close as technology allows.

[TECH: NEWSPULSE]

Most popular Tech stories right now