It could be any ad on YouTube: A blonde model playfully puts her hand in front of the camera lens, dons white sunglasses and flashes a grin. In the background, hip-hop music plays while an unmistakably female voice says, “Fashion changes, but style lasts forever.”
The ad — part of a demo reel on YouTube created by a new startup called WellSaid Labs — is short and slick. But something is a bit different. While the model you see is a human, the background voice you hear only sounds like one.
The Seattle-based company is using voice actors and artificial intelligence to create synthetic voices that sound a heck of a lot like people. The company claims the text-to-speech software it has been working on for the past year can produce audio that sounds more human-like than other synthetic voices. The reason, according to the company, is that it is not tightly controlling different variables of speech like speed, pronunciation, and volume when training its voice model.
“The voice we’re trying to create here is super expressive and lifelike in its final result,” WellSaid Labs CEO Matt Hocking told CNN Business.
Computerized voices seem to be everywhere these days, offering news from a smart speaker in your living room or giving you turn-by-turn directions in the car. Yet Alexa, Siri, Google Assistant and others that you’re likely to hear from still tend to speak in stilted, robot-tinged voices. (A notable exception, Google Duplex, can call some businesses to make reservations with an impressively human-sounding AI-enabled voice; Google is making it increasingly available, but you’d have to be on the receiving end of a phone call — at a restaurant, for instance — to hear it).
WellSaid Labs isn’t planning to take over the voice-assistant market, though. Rather, Hocking said, it hopes to sell the voices to companies that want to use them in advertising, marketing and e-learning courses.
The company says it’s building a number of human-like voices that customers will be able to use, and hopes to work with voice actors to create a different data sets that can be used to create all kinds of artificial voices.
You’ve probably heard of stock photos; you might think of this as stock voices.
To make the woman’s voice in the faux ads, WellSaid Labs first had a voice actor read articles from Wikipedia. These recordings formed a data set that it used to train an artificial neural network — a computing system whose structure is modeled loosely after neurons in a brain.
Another online demo shows how similar the AI-generated voices can sound to the actors, with audio alternating between two almost indistinguishable voices — one the human voice-over actor, one her AI-generated voice — that sound like a middle-aged woman. You might occasionally notice some differences, but they’re slight; the emphasis you’d expect might be off by just a bit in a word, for instance.
The startup said it doesn’t need to pre-process or annotate text given to the software for it to be able to do things like emphasize words in a natural-sounding way — something that is difficult for an artificial voice to do without help (though companies such as Google have been working on it). And if you fed the same text to its text-to-speech generator twice, you’d get different results.
It takes about four seconds to render a line of text right now, said chief technology officer Michael Petrochuk. The model isn’t built to interpret long pieces of text, though: it can be used to speak several sentences, but the text of an entire CNN Business article, for example, would need to be cut into pieces before it could be analyzed and spoken by a WellSaid Labs voice. (The company made one of its voices speak the headline and first paragraph of this story — take a listen and see what you think.)
It’s hard to make a synthetic voice sound consistently good. Alan Black, a professor of language technologies at Carnegie Mellon University, said that the ones we’re familiar with, such as Amazon’s Alexa, are robotic sounding because it’s tricky to make it sound natural in all situations. It’s difficult, he said, to give the right amount of information to a speech synthesizer so it can respond with the right amount of feeling.
“We don’t have a little knob on our synthesizer to say ‘Do feeling 87%,’” he said.
He listened to some of WellSaid Labs’ demo voices, and thought they sounded “pretty good.”
But if artificial voices sound close to — or indistinguishable from — humans, should listeners be clued in that they’re not listening to a real person talk? After Google demonstrated Duplex in 2018 with a call that its human-sounding AI made to a Bay Area restaurant, the tech company was criticized for not having the AI disclose what it was.
Black doesn’t think that disclosure is necessary, at least in the context of ads.
“I think that in general most people are relatively aware that what they see in video and audio is in some sense processed,” he said. “They know that when they’re watching ‘The Lord of The Rings’ there really aren’t a lot of orcs in New Zealand appearing in the movie.”