Communicating with machines: Stories are no longer far away

A story about how scientists are creating talking computers, not only that they are the most natural and human-like sounds we've ever heard coming from machines.

GM Voices Studio is hidden on the road in Alpharetta, Georgia, a rich suburb in Atlanta. It has the task of recording for the company's instructional videos, collecting reminders for the voicemail / electronic notification system, and the like - it doesn't sound very attractive, but for the cast actors. sound, it is a stable job. September Day is one of those actors, and one morning in 2011, she came here to start working on a special project.

Picture 1 of Communicating with machines: Stories are no longer far away
Voice recognition technology is being used in some high-end car models

Day is a 37-year-old mother, who has worked for many well-known customers - companies like MTV, Dominos Pizza, and Nickelodeon - who have also been told some details about the job. She knew she was hired to work on a " text-to-speech " (text to speech - TTS) project - a project to help computers read texts as loudly as humans.

Day confidently went to work even after giving birth 4 days ago, she was not prepared for what was about to happen to her.

Ivona - a Polish company specializing in TTS - has created an electronic voice that can be integrated into Kindle Fire, a mini version of the famous Amazon tablet. When Kindle Fire users click on a certain setting, they can hear "Salli" reading for themselves.

Within 8 days, from 6 to 7 hours a day, Day is responsible for reading documents such as "Alice to Wonderland" , some AP articles, and sometimes a few sentences. , while sitting still in place. She had to read hundreds of numbers, with different intonations: "One! One. One? Two! Two? Two?".

"I am like Ironman (Ironman)!" - Day shared - "I have never experienced anything like this! I used to be the queen of TV shows 30-60 seconds. That's the place for me . " By the fourth day Day had to take a leave, because her throat was hoarse. But then Day regained his courage, and returned to voice many documents.

Picture 2 of Communicating with machines: Stories are no longer far away

Stories like Day's are becoming more popular, as talking devices are emerging more and more. No longer something new, apps like Siri, GPS systems, and voice-to-speech conversion applications are on the rise. The necessity of them is easy to see: While driving, you can't search Google manually, so you have to ask the phone to find a Starbucks store.

You are in the practice room and have RSS reader software to read for you the latest financial news. Google, Apple, Microsoft or even Amazon have all invested heavily in this area, and many believe that we are seeing the beginning of conversations with machines (literally).

Currently, applications that pronounce from phones and cars have the same voice as humans. It is because it is human, or at least once human.

For every dialogue of Siri, an actor is required to sit in a recording studio to record. Once that person finishes his work, they can leave. But their recording just started their journey. It was a series of technological processes that ten years ago were impossible. But it is also the story of an inherent desire to have each of our relationships, even with inanimate objects.

J. Brant Ward is the director of voice design and development at Nuance. He has worked in Silicon Valley for more than a decade, writing concerts for synthesizers or speeches using synthesized voices.

Nuance is one of the world's largest voice recognition and text conversion companies. (Voice recognition seems to be the opposite of converting text to speech - the computer will translate your words into text.) They do many things like providing voice recorders for the medical industry. They also bring recognition technology and convert text to speech into everything from tablets to cars.

Picture 3 of Communicating with machines: Stories are no longer far away

Ward and head of the company's design team - David Vazquez - are members of a working group in the Sunnyvale, CA, USA office that creates a new generation of synthesized voices. They describe their work: "a combination of art and science".

The industry converts text into a highly competitive and confidential voice. Although it is well known that Nuance created the voices of Siri, Ward and Vazquez to avoid it when asked about this issue.

So how do they create that voice? Needless to say, everyone understands they don't do it by recording every word in the dictionary. But when you think of an application that can read aloud any new messages that appear on the new feed, or search the site for you, it needs to be able to say every word in the dictionary.

"Ask if you want to know the address of the nearest flower shop" - Ward said - "There are 27 million stores across the US. We can't record the names of each one."

"It's time to find a shortcut," Vazquez said, pulling out a document. It was like the script of Hamlet, but typed in an Excel style with strange sentences such as: Scratching the collar of my neck, where humans once had gills.

Most of the sentences have been chosen, according to Vazquez, because they are "phonetic: meaning they are a combination of many different monosyllies. These sentences are like sentences that can scare the speaker."

"The problem is, the more data we have, the more we can make the voice come alive," Ward said. These sentences, though meaningless, contain a lot of information.

After the text is recorded with an actor's voice, a process begins. It can last for months. Words and sentences are analyzed, classified and attached to a database, a complex task requires a team of linguists as well as specialized software.

When the process is complete, Nuance's text-to-speech conversion software can find the right words and combine them to create words and phrases that the actor may not have uttered. , but it sounds very much like the actor is talking, because technically that is the actor's voice.

The official name for that process is "unit selection" or "speech synthesis" . Ward describes this process as "writing a ransom letter" : the words are cut and put together into sentences - a radical simplification of the process we create language.

As humans, we learn to speak before learning to write. Talking is an unconscious act: We do that without thinking about what we're doing, and we definitely don't think about the fluctuations of the tone, the height, the speed, the The link between sounds and other factors allows us to continuously and effectively convey complex ideas and emotions. But in order for a computer to be able to do so, all of these factors must be taken into account, a job described by a language professor as "harder than ever".

Take the example of "A" sound in the word "cat". It will be pronounced with a slight difference when placed in the word "catty" or "alligator".

The sentence structure has many other challenges."If you say:" You go to San Francisco or New York ", the pitch of the sound will increase at the end of the sentence," Vazquez said. But if that's a question with many choices, like: "San Francisco, Philly or New York?" then the "York" sound will have to be lowered. Failure to handle these problems will make the user feel unnatural.

You should not think that you are talking to a machine. You should not think at all!

"My kids interacted with Siri as if it were a living object" - Ward said - "They asked him to find everything for himself. They didn't notice the difference."

The effort to create artificial voices dates back to the 18th century, when scientists conducted experiments to obtain vowels. But the best achievement was the Vocoder - a machine invented by Bell Labs in 1928, which transmitted sound, in the form of an encrypted signal, to allies during World War II. The Vocoder machine became the inspiration for Hal 9000's talking computer by writer Arthur C. Clarke in the book 2001 a Space Odyssey, and a few decades later created a fashion trend used by Pop stars like Kraftwerk.

Picture 5 of Communicating with machines: Stories are no longer far away

In the 1970s, there were many inventions in the area of speech synthesis like the Speak and Spell toy, Knight Rider-esque's talking car of the 80s (informing when fuel was about to end). ) and artificial voice created for physicist Stephen Hawking.

However, the difference between the voices of the time and now is huge. The artificial voices of that time sounded like robots because they were completely robots. Back in the 90s, the computer's capacity at that time was not enough to synthesize voices by recording people's voices, breaking into small parts, sorting and gathering. Instead, you make a computer speak by programming a series of audio parameters, like a synthesizer.

"Those machines are really simple when compared to the complexity of human voices," says Adam Wayment, deputy chief engineer of Cepstral, a company specializing in text-to-speech conversion. has created over 50 different voices. "The sound comes from the vocal cords, goes through the palate, enters the oral cavity, bounces around the tongue . all are soft tissue." So even if the synthesizer can produce clear sounds, it can't be human. Even a child is not so naive that he thinks he can talk to his Speak and Spell toy.

Until the late 2000s, the last computer was also fast enough to search from a huge database to a combination of new words that would allow companies to start producing voice-sounding voices. more natural. At the same time, artificial intelligence began to develop to a level sufficient to help computers make sophisticated decisions regarding language. When you say "wind" , do you pronounce it like saying "the wind is blowing" or "wind" in "wind the thread around the spool?". An adult will immediately recognize the difference based on context. But a computer must be taught about that context.

Promises for an existing text-to-speech conversion program from the early days of personal computers - Apple even introduced such a software in the first Mac. But it was the strong development of mobile technology and the Internet that really increased the demand for electronic voice.

Picture 6 of Communicating with machines: Stories are no longer far away

You can see the importance of technology to convert text to speech by watching the engineers work. In a letter to shareholders in November last year, Microsoft CEO Steve Ballmer emphasized the importance of "interpreting natural language and learning through machines" , or artificial intelligence. use words.

While the high-tech industry is optimistic about the future of electronic voice, surprisingly, some people are not very interested in this scenario, which is the voice actors. Yes, it is those who provide raw materials for this industry. The reason is probably just because they do not understand the effects it brings. Although there are still people, such as September Day or Allison Dufty - a former voice actor of Nuance, who is willing to disclose all information about his work, but these people are rare. The NDAs platform separates actors who can associate their names with a certain brand. Dealers with relationships with technology companies in the industry often face up and down to keep their competitive advantages. And in the absence of information, paranoia reigns.

"In our industry, we consider technology to convert text to speech (TTS) as a threat," says marketing director of Voices.com, Stephanie Ciccarelli - "They think it will replace the voice actors ".

An email was sent to a very successful voice actor, who worked for companies like Wells Fargo, NPR, AT&T, who received a polite answer but were determined: "The only thing I can talking about the opinions of the voice actors of TTS is that they think it's quite obnoxious . Maybe someday it will reach the level of current 3D movies, but now it is like a joke ".

Returning to Nuance, Ward and Vazquez are very excited about testing the new technology they are studying. Ward explains that Nuance can combine synthesized sounds and connecting sounds, and make them natural, and soon, they will be able to make all synthesized voices easy to hear. The power of the computer has reached the point where it can create a completely natural voice, without any robots.

Picture 7 of Communicating with machines: Stories are no longer far away

"But now everything still depends on the voice of the real person" - Ward said. Even a synthesized voice also needs a pattern to imitate.

One day, Ward and Vazquez tested a combination of an RSS machine with a smart music player: The program could recognize the news that it was reading was happy or sad, and from that choice. a piece of music suitable for the background, giving the listener a sense of authenticity. They cling to the idea of someday, we can hear the news in the writer's own voice, or can go home and say "I'm here" to adjust the temperature at my disposal, by Use voice recognition software and artificial intelligence.

"That's right. You can use speech recognition software to let the machine spew out a chemical, and your wife won't know you just smoked," Ward said.

Ignoring the jokes aside, this scenario is not too far away, especially considering the existence of smart home technology such as Nest, a software that can remember the temperature you desire. , and adjust the indoor temperature when you go and return.

But even good voices are not acceptable (many people really hate hearing their own voices). However, there are still promises of creating better synthesized voices, allowing us to have better communication with machines.

"It's easy to understand Siri, but what we still need is that this software can convey emotions and personalities like people," said Benjamin Munson, a professor of language arts at the University of Minnesota. At a minimum, he said, it would be great if software like Siri understands the user's mood and responds appropriately, such as how a person can gently respond to a client's angry attitude. . According to Munson, the so-called "nonverbal language" that we use is difficult to put into technology, but he stressed that many researchers have begun to learn about this issue.

Picture 8 of Communicating with machines: Stories are no longer far away

"When I started entering this industry, the majority of the synthesized voice market was for automated voicemail answering systems, and the idea of creating a voice could convey emotions and personality. how not to care " - according to Matthew Aylett, CereProc's director - " After all, you don't want the bank to announce your account balance in a dull voice while you're not Having a lot of money".

Now, when the synthesized voice is used to read the blog, even the whole Kindle book, notify the timetable, or remind you to visit her, or simply remind the time of day at work. company, then it may be time to change.

"The R2D2 robot in Star War is always my favorite robot" - Aylett said - "It still speaks like a robot, but shows its personality, emotions and ridicule. We are also trying to to do so ".

Adam Wayment shared that he once talked to a visually impaired customer, and he said: "Do you know how difficult it is to use a microwave oven? When each type has a lesson private place? " . This gave Wayment the idea of a world of talking microwaves. He said seriously: "I think that day is coming, when a small device can speak, but we are venturing to fill our lives with sounds. Only devices know. It is not enough to say it, they have to tell us what we need and what we want to know.

And if things don't work like that, we can think of another potential business: The melody of silence!

Update 11 December 2018

Read more :Voice recognition technology,