By Mara Campbell
Let me start with a straight answer to the titular question so that you can continue reading with a light heart: No! Now that we can all breathe easily, I will elaborate.
You might be familiar with the concept of Speech Recognition Software (SRS) if you use Siri, Alexa, or Google Now. The software “listens” to you, “understands” you, and takes action. As translators, many of us have used Dragon Naturally Speaking or the Speech Recognition apps that come with Windows and Apple operative systems, which listen and transcribe what we dictate. These technologies also produce live closed captions with a technique called “Respeaking.” Nowadays, automatic speech recognition (ASR) software transcribes audio or video files automatically, and some even time or spot the transcription to produce closed captions (mainly live, but off-line, too) or subtitles for a show. You might be familiar with Google’s ASR tool which produces closed captions in many YouTube videos. These SRS and ASR tools fall under the umbrella of Artificial Intelligence.
In 2016, I attended the NAB (National Association of Broadcasters) Show in Las Vegas and the IBC (International Broadcasting Convention) show in Amsterdam, both broadcasting technologies trade fairs. In those events, I met with different companies that develop ASR solutions and decided to try them out to find a tool we could use in our company to cut costs and turnaround times. My company only provides off-line subtitling and captioning services, so I wanted to see how the various software options, developed mostly with live captioning in mind, would respond in this environment.
These companies work mainly on applications of speech recognition technologies that already exist. For example, Amazon developed software for their Alexa system, which they have already “taught” to recognize speech and have loaded with a huge knowledge-base and vocabulary. Companies like the ones I tested can use that technology to inform their software development. What mostly sets companies apart is how they apply and adapt these existing technologies, and which user interface they provide.
I tested three tools, two English and one Dutch. I also am aware of a Swiss tool which I have not yet used; hopefully, in the second part of this article I will be able to present an analysis on that software, too. I prepared 10-minute clips of shows and movies with different challenges, to see how they responded and compared. Since most of these software tools offered multilingual resources, I set up a small sampling of the typical shows and films we work with.
In English: Brooklyn Nine-Nine, a fast-paced police sitcom; Drugs in Hollywood, a documentary series about drugs, featuring a scripted narrator and non-scripted dialog delivered by drunk and/or drugged people; Uncle Grandpa, an animation show for children with very fast dialog and multiple characters with cartoonish voices.
In Latin American Spanish (note that LAS presents a multitude of different accents and variants, with differing vocabulary, grammatical constructions, etc.): El Show de Jaime Maussan, a Mexican journalistic show about UFOs and aliens, mostly scripted, but with some interviews; El Puntero, an Argentine drama about a political broker who mediates between underclass poverty and politicians; El Señor de los Cielos, a Mexican show about drug lords and trafficking; Soy Luna, an Argentine teenage telenovela with many characters/actors from all over Latin America; Odisea Argentina, an un-scripted Argentine political program with three presenters who debate current events and politics.
In Brazilian Portuguese: O Caminho das Nuvens, a dramatic film with many adult and child characters.
I encountered some difficulties when testing the software, so I could not test all clips on all ASRs. The Dutch tool does not offer Portuguese as a language. One of the English companies does not have a user interface yet so I could only test one clip with them so far (via email), but they tested it on four different speech recognition services (Amazon, IBM, Google, and their proprietary one), and provided an accuracy percentage for each. The second English software supports all languages and has a very good user interface, but offered a limited trial, so I could not process each clip.
Next is the question of whether the tools provide a text transcription only, or a timed subtitling file. Both English companies only provide text transcription, which is useful only to some extent or for a small part of the job. The Dutch company generates a very well timed transcription, and has a commendable grasp of line and subtitle breaks. The Dutch and one of the English tools showed a sound management of punctuation, although they definitively do not punctuate like real linguists… but who does, right?
To date, I have the Brooklyn Nine-Nine clip processed by each tool, which allows me to do a good comparison, and then at least one version of each of the other clips, which means I cannot compare results to determine which is more accurate, but I definitely can tell if they are useful for a typical audiovisual translator’s workflow.
So far, my conclusion is that these tools are years away from becoming useful for audiovisual translators and AVT companies. First of all, the files produced by some of the ASR tools need to be segmented and/or timed, which could be quite time-consuming. The Dutch software is definitively the most useful tool, because their user interface is very good and easy to use, and they even have a built-in editing tool for timing and text. But the actual transcription requires heavy editing.
Here there are three examples from the Brooklyn Nine-Nine show for comparison. For the English company (“English company 1”) that provided four different transcriptions from different services, I will use the one generated by Amazon, which had an overall accuracy scoring of 91%.
EXAMPLE 1 (Quotes added by me; line breaks, capital letters, and punctuation — or lack thereof — are original):
English company 1:
“it’s my craft anyways grandson’s coming in they reunite and I throw another case on the old solved it hey micros aren’t you want to see me captain”
Dutch company:
17
00:00:53,120 –> 00:00:59,760
It’s my craft anyways grandsons
coming in they reunite and I threw
another case on the old salted pile.
18
00:01:03,640 –> 00:01:11,840
Hey Microsoft. You wanted to see me
captain.
(Note that in subtitle 2 there are actually two different speakers; in real-life subtitling, we would need to adjust the line break and add a dialog line or split the subtitle in two.)
English company 2:
“It’s my craft. Anyways grandson’s coming in. They reunite and I throw another case on the old solved pile. Microsoft. Went to see me Captain.” (Note: this tool identifies different speakers relatively accurately, even marking them as a female or male voice, and, in this case, it did recognize a change of speaker in “Went to see me Captain.”)
The correct human-made subtitling:
31
00:00:53,178 –> 00:00:54,847
It’s my craft.
32
00:00:54,847 –> 00:00:56,598
Anyways,
grandson’s coming in.
33
00:00:56,640 –> 00:00:58,100
They reunite,
and I throw another case
34
00:00:58,100 –> 00:00:59,810
on the old “solved it” pile.
35
00:01:03,272 –> 00:01:04,606
Hey, my croissant.
36
00:01:10,153 –> 00:01:11,613
You wanted to see me,
Captain?
EXAMPLE 2 (Quotes added by me; line breaks, capital letters, and punctuation—or lack thereof—are original):
English company 1:
“absolutely sir I won’t just head it up I will head and shoulders it up I will dive into women rounded just be just be altogether good with it be more articulate when you speak to the children” (Note: “Be more articulate…” is spoken by another character.)
Dutch company:
22
00:01:25,040 –> 00:01:35,760
Absolutely, sir, I won’t just headed
up I will head and shoulders it up I
will die women around it just be
altogether good with it
23
00:01:35,960 –> 00:01:38,120
, the more articulate when you speak
to the children.
English company 2:
“Absolutely sir. I won’t just headed up I will head and shoulders it up. I will dive in swim around it. Just be all together good with it.
Be more articulate when you speak to the children.” (The software identified the change in speaker and separated the text in another line.)
The correct human-made subtitling:
45
00:01:24,168 –> 00:01:26,670
Absolutely, sir.
46
00:01:26,670 –> 00:01:28,046
I won’t just head it up,
47
00:01:28,046 –> 00:01:29,798
I will
head and shoulders it up.
48
00:01:29,798 –> 00:01:32,050
I will dive in,
swim around it,
49
00:01:32,050 –> 00:01:35,762
and just be altogether
good with it.
50
00:01:35,762 –> 00:01:38,390
Be more articulate
when you speak to the children.
EXAMPLE 3
English company 1:
“uh mystery missus torino I’m glad you’re here Man present to you Oh my darlings Thank god Found you uh look at those beautiful cheeks” (Note that sentences starting with “Oh, my darlings” are spoken by a second speaker.)
Dutch company:
47
00:02:50,640 –> 00:02:55,440
Mr-Mrs to Reno glad you’re here
man present to you properly.
48
00:02:55,640 –> 00:03:01,680
My darlings then God I found you oh
look at those who in full she.
English company 2:
“This you’re Mrs. Torino. I’m glad you’re here. May I present to you.
My darlings. Thank God I found you. Oh look at those beautiful cheeks.” (The software identified the speaker change and separated text into another line.)
The correct human-made subtitling:
89
00:02:50,128 –> 00:02:51,755
Ah, Mr. and Mrs. Terrino.
90
00:02:51,755 –> 00:02:53,173
I’m glad you’re here.
91
00:02:53,173 –> 00:02:54,842
May I present to you…
92
00:02:54,842 –> 00:02:58,637
Oh, my darlings.
Thank God I found you.
93
00:02:58,637 –> 00:03:01,598
Oh, look at
those beautiful cheeks.
English company 2 consistently presents more accurate transcriptions and includes better punctuation and capital letters. But the Dutch company, although not quite as accurate in the transcription, creates timed subtitles (if you compare the timecodes with the final subtitles, you see that they are very well timed), which might compensate for some of the misunderstandings.
Regarding the other clips in English that I got transcribed, the results were not very good. The file for Drugs in Hollywood was not useful at all, but I actually expected that, being a non-scripted show full of slurred dialog and slang. Even a human transcriber would have trouble tackling that show. And the one for Uncle Grandpa, although better, still presented errors probably caused by the cartoonish voices and the outlandish topics covered, including strange names and invented words, so the software had serious problems with it, too. Clearly, this technology is still not perfected to be used on all kinds of material, and it probably performs acceptably on scripted material and single-speaker shows like newscasts, documentaries, lectures, sermons, etc.
As for the tests I performed on the Spanish clips, the results were very poor. The different accents and variants were too much for the ASR software options to compute, and the transcriptions were utterly unintelligible. Editing would have been more time-consuming than transcribing from scratch. The Portuguese was also quite bad, although better, compared to the Spanish.
I knew quite well that these tools would provide minimal assistance and extensive post-edition would be needed before resulting ASR files could be used as translation templates or for delivery to client. My next step will be to edit the English transcriptions into final deliverables and compare the time and effort each takes. I will share my findings in part 2 of this article, coming soon in the next issue of Deep Focus.