Microsoft's AI is getting crazily good at speech recognition

Angelica Greene
August 22, 2017

Microsoft's speech recognition capabilities are based on neural networks, and other artificial intelligence (AI) technologies.

Microsoft Corporation (NASDAQ:MSFT) announced today that its conversational speech recognition engine - which converts human speech into text - has achieved a new record for accuracy.

The company's researchers have produced a new paper describing the advances they have made since past year, when they said their systems had achieved "human parity" with a 5.9% error rate in transcribing audio conversations.

Using Switchboard, speech recognition systems are tasked with transcribing conversations about topics such as politics or sports, for example.

But it has now achieved this - reducing its error rate by 12%, and using AI techniques like "neural-net based acoustic and language models". Professionals hit a 5.9% word error rate, matching the results of Microsoft's system. Researchers also combined predictions for multiple acoustic models.

Horizon Zero Dawn Easy Story Mode Out Now
There will always be hardcore players who are eager to test their capabilities with the most hard experience possible. Patch 1.32 also makes a slew of bug and progression fixes that can be found in more detail in the full patch notes .

New images emerge of Barcelona terror attack suspect
Police found Abouyaaqoub hiding in vineyards and shot him dead a kilometre down the road next to a sewage treatment plant. All of the main suspects are believed to have lived in the small town, which has a Muslim community of around 500 people.

THQ Nordic Announces New IP Called 'BioMutant'
Due out in PlayStation 4, Xbox One, and PC in 2018, the magazine calls the game "the post-apocalpytic kung-fu fable". If you want, you can pre-order the Standard Edition or Collector's Edition for the PS4 via those Amazon links.

The company goes on to explain that the system can now tune itself better to the language model of the individual's previous conversations to better predict what it would say next to improve the topic and context - something that its current public bots are doing less sell at. Of course, the researchers benefited most from using the Microsoft Cognitive Toolkit 2,1, and Azure GPUs, which allowed them to explore model architectures, and optimize hyper-parameters, as well as improve the speed as which models could be tested.

"Reaching human parity with an accuracy on par with humans has been a research goal for the last 25 years", Xuedong Huang wrote.

Despite the new levels of WER, Microsoft noted in the post that there are still many challenges to address with speech recognition. They also need to work on systems that can account for accents and styles of speech, while also teaching the machines to understand the meaning of the words they're transcribing.

Huang states, "Moving from recognising to understanding speech is the next major frontier for speech technology". But now the speech recognition tech can look at context for clues.

Microsoft used the Switchboard library of conversations to train its speech recognition system and achieve human parity.

Other reports by GizPress

Discuss This Article