For the past 25 years, it’s been a major research goal in the field of computer science to build an artificial intelligence (AI) that’s capable of accurate language recognition.
It’s clear that huge strides have been made, with developments in machine learning and natural language processing allowing technology companies to build impressive AIs like Apple’s Siri, Microsoft’s Cortana and Amazon’s Alexa. These digital assistants can now accomplish some pretty impressive feats – whether you’re lost in a strange city, craving a pizza or just want to blast out some music, they’ve got your back. Just ask a question and they’ll listen, understand, and offer a solution to your query.
However, one goal in machine learning still stays out of reach: an artificial intelligence that can produce long-form transcriptions of human speech with the same level of accuracy as a human listener. Despite wider advances in the technology available, the complexity of human language makes this incredibly difficult to achieve.
In this article, we’ll investigate what makes this such a challenge and offer an answer to the following question:
Why can’t computers transcribe speech as well as humans?
WHAT IS MACHINE LEARNING?
The ultimate goal of artificial intelligence research is to create systems that understand, think, learn and behave like humans – a goal which was most famously proposed by pioneering computer scientist Alan Turing. He devised the Turing Test, a highly influential benchmark experiment that measures a machine’s ability to display intelligent behaviour that’s indistinguishable from that of a human. This test is based on speech, with an evaluator tasked with watching a text conversation between a machine and a human, and seeing if they can tell the difference. If the evaluator can’t tell the difference, the machine has passed the test.
Although some programmes, including chatbots, have technically passed the test, this has been subject to some criticism. Indeed, others have claimed that we’re still a decade away from truly achieving this.
Machine learning is an AI principle that scientists believe will enable them to build this truly intelligent AI. Put simply, they strive to design computer systems that have the ability to learn independently and without human assistance. These AIs can look at data, analyse it, and use it. They achieve this by first making observations to find patterns, which can then be used to draw conclusions and make decisions in the future. For instance, you could show a computer photos of different food, with some labelled “this is a pizza” and others “this is not a pizza”. Using the learnings from this first dataset, the computer would soon be able to recognise pizza autonomously, becoming more and more accomplished at this task over time.
Admittedly, recognising pizza might not sound like the most useful or exciting achievement in the world, but this is just a basic example. In reality, machine learning is already everywhere and is used in incredibly advanced applications. Machine learning’s put autonomous robots on Mars, is used by traders to predict the fluctuations of the stock market, and has been employed by pharmacologists to discover groundbreaking new drugs.
On a smaller scale, it’s used to predict what you’ll buy on Amazon, display relevant ads on Facebook and to produce the right search results when you Google something. Plus, an AI can probably beat you at chess; in 2015 a deep learning machine called ‘Giraffe‘ reached international master level after just 72 hours of practice!
So if an AI can explore another planet, why can’t it transcribe text?
THE CHALLENGE OF HUMAN LANGUAGE FOR AI
There’s a discipline in computer science called Natural Language Processing, which focuses on using machine learning and artificial intelligence to create a computer that can understand and respond to human language. For this, computers need to develop models of human language that reflect the true nuances of human speech. As we mentioned earlier, digital assistants like Siri represent significant progress in this field, as they can accurately listen to and respond to requests. This has been rapid improvement too – Google has slashed its speech recognition error rate by more than 30% since it first introduced the technology in 2012.
However, serious obstacles still remain in Natural Language Processing. Although computers can use machine learning to study patterns in language, it’s ultimately variable, idiomatic and very contextual. This is particularly the case with speech. We don’t speak in the same way that we write; for one, written language has to be taught, rather than being naturally acquired like speech is. It has patterns and rules that we can teach to a computer so that it’s able to read, understand, and process it.
This is why text mining – where computers use machine learning techniques to analyse textual content – is both possible and an effective way to reveal patterns and relationships in large amounts of text. After all, a computer can read much faster than a person. (Incidentally, the Turing Test judges computers based on their ability to respond to a text-based instant message style conversation, not a spoken one.)
On the other hand, spoken language is full of irregularities like pauses, mispronunciations and non-standard grammar. This makes it tricky for an artificial intelligence to classify it and understand its patterns.
On top of that, to analyse speech, a computer first has to recognise and record the speech as text. This means understanding the way that different sounds relate to fixed written words, but as people talk differently, it’s hard to get consistent results. For instance, someone saying “hello” quickly will produce a much smaller sound file than someone who says it more slowly. It’s actually very difficult for an artificial intelligence to then realise that these two drastically different sound files represent the same fixed-length word and have an identical meaning.
THE LIMITATIONS OF AUTOMATED TRANSCRIPTION
Naturally, a goal of artificial intelligence researchers has been to create a computer that can transcribe human speech with the same accuracy as human transcribers. However, this goal is easier said than done for the reasons stated above. As computers struggle to recognise and record natural speech, it’s difficult for them to produce an accurate transcript that captures the meaning of what was said.
Some advances have been made, but commercially available technology is currently quite limited in application and often requires a human speaker to go through and correct the errors in the final transcript. Here are just some of the limitations that are preventing AI transcription services from squeezing humans out of the equation:
1. ARTIFICIAL INTELLIGENCE STRUGGLES WITH MULTIPLE SPEAKERS
Current tech is far more successful when transcribing the speech of a single person. When there are multiple people, AIs can’t account for all the additional ambiguity that arises during a conversation.
Think of all the chaos in a normal conversation: people interrupt each other, some people speak more loudly than others, everyone starts speaking at the same time… All of these factors get in the way of an AI’s ability to transcribe, as it will struggle to separate all this overlapping speech. Plus, you really need the microphone to be very close to the speaker to get a clear recording, which is often difficult to achieve in a group setting.
This means that this sort of transcription is only useful if you have a single speaker. If you want to record anything where more than one person is speaking – whether it’s a focus group, a triad interview or a podcast – the computer won’t be able to figure out who’s who. This of course makes a transcript pretty useless. Meanwhile, a human transcriber will pick this information up automatically.
2. COMPUTERS STRUGGLE WITH BACKGROUND NOISE
For AIs to transcribe speech, the recording needs to be good quality and very clean. Background noise, music and traffic will all reduce the accuracy of a transcript.
We can’t always achieve perfect recording conditions, especially when outdoors or in a crowded area, so this is a huge limiting factor for the spread of AI-powered transcription services. Although a clean recording is nice for a human transcriber, it’s not essential – we can work around it!
Some have argued that readily available AI transcription would make transcribing everything the norm, but until they’re able to be used in a variety of different settings this seems unlikely.
3. COMPUTERS STRUGGLE WITH ACCENTS
Much of the research into Natural Language Processing has focused primarily on American English and British English. This means that a different accent can really throw transcribing AIs for a loop, as they’re often only trained using American English and British English accents, which limits their ability to understand anything else. It’s clear that this is a real problem for voice recognition generally, with people complaining that Alexa and Siri can’t understand their accents.
On the other hand, humans are exposed to different accents on a daily basis. This is especially the case for experienced transcribers, who will be used to listening to and accurately transcribing accented speech.
WILL TRADITIONAL TRANSCRIPTION SERVICES BE REPLACED BY AI?
Overall it seems that, despite decades of research, humans are still far more effective transcribers than their machine counterparts. You can ask a person to transcribe anything and they’ll at least give it go, whilst AIs are easily brought to a halt by the natural nuances and variations present in speech.
Even more advanced technology produces transcripts that need to be checked over by a person, so it’s not a fully automated process – you’d still need to spend time reading through the transcript and listening to the original recording to fix any mistakes. Just how much time would an AI save then?
For the time being at least, AIs certainly aren’t capable of the same results as human transcription services, and they simply don’t represent a universal solution due to their limitations. Ask a person to transcribe accented speech and it might take them a bit longer than usual. Ask a computer the same thing and you’ll probably end up with total nonsense!