Microsoft's AI is getting crazily good at speech recognition

Angelica Greene
August 22, 2017

Microsoft's speech recognition capabilities are based on neural networks, and other artificial intelligence (AI) technologies.

Microsoft Corporation (NASDAQ:MSFT) announced today that its conversational speech recognition engine - which converts human speech into text - has achieved a new record for accuracy.

The company's researchers have produced a new paper describing the advances they have made since past year, when they said their systems had achieved "human parity" with a 5.9% error rate in transcribing audio conversations.

Using Switchboard, speech recognition systems are tasked with transcribing conversations about topics such as politics or sports, for example.

But it has now achieved this - reducing its error rate by 12%, and using AI techniques like "neural-net based acoustic and language models". Professionals hit a 5.9% word error rate, matching the results of Microsoft's system. Researchers also combined predictions for multiple acoustic models.

President Trump to outline Afghanistan strategy in primetime address
Even President Donald Trump couldn't help but look to the sky on Monday to view the first total solar eclipse since 1979. The moon hasn't thrown this much shade at the US since 1918, during the nation's last coast-to-coast total eclipse.

WWE SummerSlam: Brock Lesnar, Jinder Mahal retain titles
Will these teams rematch on Raw? Can Sasha break her "Brooklyn Curse?" Though a Sasha victory will get the biggest pop for sure. AJ Styles will defend the WWE United States Championship against Kevin Owens with Shane McMahon as the special guest referee.

Pentagon chief due in Turkey
Mahmoud Freihat, the chairman of Jordan's joint chiefs of staff. Secretary of State Rex Tillerson visited Ukraine in July.

The company goes on to explain that the system can now tune itself better to the language model of the individual's previous conversations to better predict what it would say next to improve the topic and context - something that its current public bots are doing less sell at. Of course, the researchers benefited most from using the Microsoft Cognitive Toolkit 2,1, and Azure GPUs, which allowed them to explore model architectures, and optimize hyper-parameters, as well as improve the speed as which models could be tested.

"Reaching human parity with an accuracy on par with humans has been a research goal for the last 25 years", Xuedong Huang wrote.

Despite the new levels of WER, Microsoft noted in the post that there are still many challenges to address with speech recognition. They also need to work on systems that can account for accents and styles of speech, while also teaching the machines to understand the meaning of the words they're transcribing.

Huang states, "Moving from recognising to understanding speech is the next major frontier for speech technology". But now the speech recognition tech can look at context for clues.

Microsoft used the Switchboard library of conversations to train its speech recognition system and achieve human parity.

Other reports by GizPress

Discuss This Article