Over the past five years, a small team in the Division of Conference Management (DCM) at the United Nations Office at Geneva has been working with the Advanced Technology Applications Center of the World Intellectual Property Organization (WIPO) to teach a deep recurrent neural network to recognize spoken words in six languages and to output them in the form of an automatic transcription. Meet your new note-taker, FAST, for Fully Automated Speech-to-Text, powered by WIPO AI. You can check what the youngest child born of our inter-institutional collaboration has learned to do so far by looking up any public meeting on the Digital Recordings Portal, where the transcripts are displayed in parallel to their respective audios, by language channel. You will notice the resulting text is synced to the recording, essentially rendering it searchable. Looking up a keyword in the transcript will take you directly to the corresponding place in the audio to check against delivery.
While we may take understanding and being understood for granted, those are remarkably complex tasks. The fact that computers can emulate them at all and do so with such impressive results is nothing short of astonishing. Children and non-native speakers – basically anyone learning a language – would surely concur.
In fact, try to remember learning how to read, how to associate sounds with letters. Do you remember your first ABC book? Hearing your grown-up articulate /’si:/ /’eɪ/ /’ti:/, while you watch the tip of their finger take a trip of three steps to tap, at three, on the “t”… C-A-T. This is a /’kæt/, you hear them say. Cat.
Automatic Speech Recognition (ASR) systems, like the one behind FAST, and machine learning models in general are not all that different from children. They are great at learning through examples rather than instructions. The more examples, the better. FAST was trained on thousands of hours of utterances with matching text.
Still, similarly to children and non-native speakers, FAST might get flummoxed by acronyms, terms, and proper names that it has never heard before or seen written down. It sometimes struggles with punctuation, paragraph segmentation, and capital letters. Like all of us, it stumbles when the sound quality is poor or when speech is heavily accented, atypical, obstructed, overlapping or inaudible. When we are tuned to listen to an English speaker, an unexpected accented “je vous remercie” might throw us off, and it would take our brains a moment to recalibrate or to make up something English-sounding instead. Likewise, language detection is not trivial for an ASR. Somewhat counterintuitively, the further removed languages are from each other phonetically, the more it stumps the machine. Whereas Romance languages might all lump together for a human ear, a computer can shift back and forth between them with relative ease. While gear shifts from French to Russian to Arabic to Chinese will be clearly discernible to us, current ASR models are likely to falter.
Mondegreens, or mishearings with hilarious outcomes, are also something we have in common with speech recognition models. ASR-generated captioning relies on a predictive text component. Guesswork based on the context and word probability, or frequency of use, is no stranger to our brains. What were the chances that Jimi Hendrix’s “Excuse me while I kiss…” would finish with A) “this guy” or B) “the sky”? In the end, we are all just trying to make sense of similarly sounding waveforms, otherwise known as oronyms, as best we can.
Some things that take no additional cognitive effort on our part, like understanding words spoken in a female rather than a male voice, are more complex for speech recognition models. The often cited reason for that is that women have tighter and thinner vocal cords producing sounds of a higher pitch or frequency and resulting in different waveforms that are harder for an ASR system to process.
The real reason? Unlike humans, it just hasn’t been exposed to enough examples of those sound waves to learn from. This is why UNOG’s training data for FAST included plenty of female voices.
Some things are infinitely easier for a machine though. No matter how fast we write, type, or render speech in shorthand, we will never be able to produce a workable draft transcript of a 3-hour meeting in under fifteen minutes. Those three hours cannot be compressed by our brains, less so transcribed at such speeds. If you get tired or distracted and miss parts of what was said in a meeting, FAST has your back.
To give the human child-machine analogy one last stretch, let’s take a look at how we measure what either one of them has learned. We don’t crack children open and go poking in their brains to understand how well they have absorbed studied material. The black box of a machine learning model is also resistant to poking. Instead, we assess and score their performance. We have them take tests. One of the most important and challenging tasks for the FAST project team is to create such tests, or so-called “Gold Standard” evaluation sets, in each language. The demos shared by commercial ASR providers often resort to cheating. Transcribing a voice of a middle-aged male native speaker enunciating clearly and slowly reading out a generic text in a room with no ambient noise will likely get you a near 100% accuracy score. Like a competitive language examination, the Gold Standard tests must be representative of the UN register and subject domains in all of their complexity. Furthermore, they must be true to a real-life conference environment with its hustle and bustle, speeds of delivery, surprise transitions between languages, speakers with speech difficulties and more non-native, female, and ageing voices.
I am proud to report that we have taught FAST well. It took several tests so far, competing alongside the industry heavyweights, such as Google, Microsoft, and Amazon, and outperformed them all in both English and French.
The English version of the FAST transcription tool has been used for over three years now to deliver automatic searchable transcripts to meeting attendants, secretariats, and staff. In 2022, the service was expanded to French and Spanish, and the meeting transcripts found a permanent home on the Digital Recordings Portal. Since September 2023, we have been testing Arabic, Chinese, and Russian versions of the tool. Please be sympathetic to its struggles and occasional mistakes, FAST has worked very hard to learn new languages