{"id":719,"date":"2019-08-13T14:06:06","date_gmt":"2019-08-13T14:06:06","guid":{"rendered":"https:\/\/www.ata-divisions.org\/AVD\/?p=719"},"modified":"2019-08-13T18:29:38","modified_gmt":"2019-08-13T18:29:38","slug":"automatic-speech-recognition-softwares-will-they-replace-audiovisual-translators-in-the-near-future-part-1","status":"publish","type":"post","link":"https:\/\/www.ata-divisions.org\/AVD\/automatic-speech-recognition-softwares-will-they-replace-audiovisual-translators-in-the-near-future-part-1\/","title":{"rendered":"Automatic Speech Recognition Softwares: Will they replace audiovisual translators in the near future? (Part 1)"},"content":{"rendered":"<div class=\"entry-content\" style=\"text-align: center;\">\n<h4 class=\"has-medium-font-size\" style=\"text-align: left;\">By Mara Campbell<\/h4>\n<p style=\"text-align: left;\">Let me start with a straight answer to the titular question so that you can continue reading with a light heart: No! Now that we can all breathe easily, I will elaborate.<\/p>\n<p style=\"text-align: left;\">You might be familiar with the concept of Speech Recognition Software (SRS) if you use Siri, Alexa, or Google Now. The software \u201clistens\u201d to you, \u201cunderstands\u201d you, and takes action. As translators, many of us have used Dragon Naturally Speaking or the Speech Recognition apps that come with Windows and Apple operative systems, which listen and transcribe what we dictate. These technologies also produce live closed captions with a technique called \u201cRespeaking.\u201d Nowadays, automatic speech recognition (ASR) software transcribes audio or video files automatically, and some even time or spot the transcription to produce closed captions (mainly live, but off-line, too) or subtitles for a show. You might be familiar with Google\u2019s ASR tool which produces closed captions in many YouTube videos. These SRS and ASR tools fall under the umbrella of Artificial Intelligence.<\/p>\n<p style=\"text-align: left;\">In 2016, I attended the NAB (National Association of Broadcasters) Show in Las Vegas and the IBC (International Broadcasting Convention) show in Amsterdam, both broadcasting technologies trade fairs. In those events, I met with different companies that develop ASR solutions and decided to try them out to find a tool we could use in our company to cut costs and turnaround times. My company only provides off-line subtitling and captioning services, so I wanted to see how the various software options, developed mostly with live captioning in mind, would respond in this environment.<\/p>\n<p style=\"text-align: left;\">These companies work mainly on applications of speech recognition technologies that already exist. For example, Amazon developed software for their Alexa system, which they have already \u201ctaught\u201d to recognize speech and have loaded with a huge knowledge-base and vocabulary. Companies like the ones I tested can use that technology to inform their software development. What mostly sets companies apart is how they apply and adapt these existing technologies, and which user interface they provide.<\/p>\n<p style=\"text-align: left;\">I tested three tools, two English and one Dutch. I also am aware of a Swiss tool which I have not yet used; hopefully, in the second part of this article I will be able to present an analysis on that software, too. I prepared 10-minute clips of shows and movies with different challenges, to see how they responded and compared. Since most of these software tools offered multilingual resources, I set up a small sampling of the typical shows and films we work with.<\/p>\n<p style=\"text-align: left;\"><strong>In English<\/strong>: <em>Brooklyn Nine-Nine<\/em>, a fast-paced police sitcom; <em>Drugs in Hollywood<\/em>, a documentary series about drugs, featuring a scripted narrator and non-scripted dialog delivered by drunk and\/or drugged people; <em>Uncle Grandpa<\/em>, an animation show for children with very fast dialog and multiple characters with cartoonish voices.<\/p>\n<p style=\"text-align: left;\"><strong>In Latin American Spanish<\/strong> (note that LAS presents a multitude of different accents and variants, with differing vocabulary, grammatical constructions, etc.): <em>El Show de Jaime Maussan<\/em>, a Mexican journalistic show about UFOs and aliens, mostly scripted, but with some interviews; <em>El Puntero<\/em>, an Argentine drama about a political broker who mediates between underclass poverty and politicians; <em>El Se\u00f1or de los Cielos<\/em>, a Mexican show about drug lords and trafficking; <em>Soy Luna<\/em>, an Argentine teenage <em>telenovela<\/em> with many characters\/actors from all over Latin America; <em>Odisea Argentina<\/em>, an un-scripted Argentine political program with three presenters who debate current events and politics.<\/p>\n<p style=\"text-align: left;\"><strong>In Brazilian Portuguese<\/strong>: <em>O Caminho das Nuvens<\/em>, a dramatic film with many adult and child characters.<\/p>\n<p style=\"text-align: left;\">I encountered some difficulties when testing the software, so I could not test all clips on all ASRs. The Dutch tool does not offer Portuguese as a language. One of the English companies does not have a user interface yet so I could only test one clip with them so far (via email), but they tested it on four different speech recognition services (Amazon, IBM, Google, and their proprietary one), and provided an accuracy percentage for each. The second English software supports all languages and has a very good user interface, but offered a limited trial, so I could not process each clip.<\/p>\n<p style=\"text-align: left;\">Next is the question of whether the tools provide a text transcription only, or a timed subtitling file. Both English companies only provide text transcription, which is useful only to some extent or for a small part of the job. The Dutch company generates a very well timed transcription, and has a commendable grasp of line and subtitle breaks. The Dutch and one of the English tools showed a sound management of punctuation, although they definitively do not punctuate like real linguists\u2026 but who does, right?<\/p>\n<p style=\"text-align: left;\">To date, I have the <em>Brooklyn Nine-Nine<\/em> clip processed by each tool, which allows me to do a good comparison, and then at least one version of each of the other clips, which means I cannot compare results to determine which is more accurate, but I definitely can tell if they are useful for a typical audiovisual translator\u2019s workflow.<\/p>\n<p style=\"text-align: left;\">So far, my conclusion is that these tools are years away from becoming useful for audiovisual translators and AVT companies. First of all, the files produced by some of the ASR tools need to be segmented and\/or timed, which could be quite time-consuming. The Dutch software is definitively the most useful tool, because their user interface is very good and easy to use, and they even have a built-in editing tool for timing and text. But the actual transcription requires heavy editing.<\/p>\n<p style=\"text-align: left;\">Here there are three examples from the <em>Brooklyn Nine-Nine<\/em> show for comparison. For the English company (\u201cEnglish company 1\u201d) that provided four different transcriptions from different services, I will use the one generated by Amazon, which had an overall accuracy scoring of 91%.<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: left;\"><strong>EXAMPLE 1 <\/strong>(Quotes added by me; line breaks, capital letters, and punctuation \u2014 or lack thereof \u2014 are original):<\/p>\n<p style=\"text-align: left;\"><strong>English company 1: <\/strong><\/p>\n<p style=\"text-align: left;\">\u201cit\u2019s my craft anyways grandson\u2019s coming in they reunite and I throw another case on the old solved it hey micros aren\u2019t you want to see me captain\u201d<\/p>\n<p style=\"text-align: left;\"><strong>Dutch company: <\/strong><\/p>\n<p style=\"text-align: left;\">17<br \/>\n00:00:53,120 \u2013&gt; 00:00:59,760<br \/>\nIt\u2019s my craft anyways grandsons<br \/>\ncoming in they reunite and I threw<br \/>\nanother case on the old salted pile.<\/p>\n<p style=\"text-align: left;\">18<br \/>\n00:01:03,640 \u2013&gt; 00:01:11,840<br \/>\nHey Microsoft. You wanted to see me<br \/>\ncaptain.<\/p>\n<p style=\"text-align: left;\">(Note that in subtitle 2 there are actually two different speakers; in real-life subtitling, we would need to adjust the line break and add a dialog line or split the subtitle in two.)<\/p>\n<p style=\"text-align: left;\"><strong>English company 2: <\/strong><\/p>\n<p style=\"text-align: left;\">\u201cIt\u2019s my craft. Anyways grandson\u2019s coming in. They reunite and I throw another case on the old solved pile. Microsoft. Went to see me Captain.\u201d (Note: this tool identifies different speakers relatively accurately, even marking them as a female or male voice, and, in this case, it did recognize a change of speaker in \u201cWent to see me Captain.\u201d)<\/p>\n<p style=\"text-align: left;\"><strong>The correct human-made subtitling: <\/strong><\/p>\n<p style=\"text-align: left;\">31<br \/>\n00:00:53,178 \u2013&gt; 00:00:54,847<br \/>\nIt\u2019s my craft.<\/p>\n<p style=\"text-align: left;\">32<br \/>\n00:00:54,847 \u2013&gt; 00:00:56,598<br \/>\nAnyways,<br \/>\ngrandson\u2019s coming in.<\/p>\n<p style=\"text-align: left;\">33<br \/>\n00:00:56,640 \u2013&gt; 00:00:58,100<br \/>\nThey reunite,<br \/>\nand I throw another case<\/p>\n<p style=\"text-align: left;\">34<br \/>\n00:00:58,100 \u2013&gt; 00:00:59,810<br \/>\non the old \u201csolved it\u201d pile.<\/p>\n<p style=\"text-align: left;\">35<br \/>\n00:01:03,272 \u2013&gt; 00:01:04,606<br \/>\nHey, my croissant.<\/p>\n<p style=\"text-align: left;\">36<br \/>\n00:01:10,153 \u2013&gt; 00:01:11,613<br \/>\nYou wanted to see me,<br \/>\nCaptain?<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: left;\"><strong>EXAMPLE 2 <\/strong>(Quotes added by me; line breaks, capital letters, and punctuation\u2014or lack thereof\u2014are original):<\/p>\n<p style=\"text-align: left;\"><strong>English company 1: <\/strong><\/p>\n<p style=\"text-align: left;\">\u201cabsolutely sir I won\u2019t just head it up I will head and shoulders it up I will dive into women rounded just be just be altogether good with it be more articulate when you speak to the children\u201d (Note: \u201cBe more articulate\u2026\u201d is spoken by another character.)<\/p>\n<p style=\"text-align: left;\"><strong>Dutch company:<\/strong><\/p>\n<p style=\"text-align: left;\">22<br \/>\n00:01:25,040 \u2013&gt; 00:01:35,760<br \/>\nAbsolutely, sir, I won\u2019t just headed<br \/>\nup I will head and shoulders it up I<br \/>\nwill die women around it just be<br \/>\naltogether good with it<\/p>\n<p style=\"text-align: left;\">23<br \/>\n00:01:35,960 \u2013&gt; 00:01:38,120<br \/>\n, the more articulate when you speak<br \/>\nto the children.<\/p>\n<p style=\"text-align: left;\"><strong>English company 2:<\/strong><\/p>\n<p style=\"text-align: left;\">\u201cAbsolutely sir. I won\u2019t just headed up I will head and shoulders it up. I will dive in swim around it. Just be all together good with it.<\/p>\n<p style=\"text-align: left;\">Be more articulate when you speak to the children.\u201d (The software identified the change in speaker and separated the text in another line.)<\/p>\n<p style=\"text-align: left;\"><strong>The correct human-made subtitling:<\/strong><\/p>\n<p style=\"text-align: left;\">45<br \/>\n00:01:24,168 \u2013&gt; 00:01:26,670<br \/>\nAbsolutely, sir.<\/p>\n<p style=\"text-align: left;\">46<br \/>\n00:01:26,670 \u2013&gt; 00:01:28,046<br \/>\nI won\u2019t just head it up,<\/p>\n<p style=\"text-align: left;\">47<br \/>\n00:01:28,046 \u2013&gt; 00:01:29,798<br \/>\nI will<br \/>\nhead and shoulders it up.<\/p>\n<p style=\"text-align: left;\">48<br \/>\n00:01:29,798 \u2013&gt; 00:01:32,050<br \/>\nI will dive in,<br \/>\nswim around it,<\/p>\n<p style=\"text-align: left;\">49<br \/>\n00:01:32,050 \u2013&gt; 00:01:35,762<br \/>\nand just be altogether<br \/>\ngood with it.<\/p>\n<p style=\"text-align: left;\">50<br \/>\n00:01:35,762 \u2013&gt; 00:01:38,390<br \/>\nBe more articulate<br \/>\nwhen you speak to the children.<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: left;\"><strong>EXAMPLE 3 <\/strong><\/p>\n<p style=\"text-align: left;\"><strong>English company 1:<\/strong><\/p>\n<p style=\"text-align: left;\">\u201cuh mystery missus torino I\u2019m glad you\u2019re here Man present to you Oh my darlings Thank god Found you uh look at those beautiful cheeks\u201d (Note that sentences starting with \u201cOh, my darlings\u201d are spoken by a second speaker.)<\/p>\n<p style=\"text-align: left;\"><strong>Dutch company: <\/strong><\/p>\n<p style=\"text-align: left;\">47<br \/>\n00:02:50,640 \u2013&gt; 00:02:55,440<br \/>\nMr-Mrs to Reno glad you\u2019re here<br \/>\nman present to you properly.<\/p>\n<p style=\"text-align: left;\">48<br \/>\n00:02:55,640 \u2013&gt; 00:03:01,680<br \/>\nMy darlings then God I found you oh<br \/>\nlook at those who in full she.<\/p>\n<p style=\"text-align: left;\"><strong>English company 2:<\/strong><\/p>\n<p style=\"text-align: left;\">\u201cThis you\u2019re Mrs. Torino. I\u2019m glad you\u2019re here. May I present to you.<\/p>\n<p style=\"text-align: left;\">My darlings. Thank God I found you. Oh look at those beautiful cheeks.\u201d (The software identified the speaker change and separated text into another line.)<\/p>\n<p style=\"text-align: left;\"><strong>The correct human-made subtitling:<\/strong><\/p>\n<p style=\"text-align: left;\">89<br \/>\n00:02:50,128 \u2013&gt; 00:02:51,755<br \/>\nAh, Mr. and Mrs. Terrino.<\/p>\n<p style=\"text-align: left;\">90<br \/>\n00:02:51,755 \u2013&gt; 00:02:53,173<br \/>\nI\u2019m glad you\u2019re here.<\/p>\n<p style=\"text-align: left;\">91<br \/>\n00:02:53,173 \u2013&gt; 00:02:54,842<br \/>\nMay I present to you\u2026<\/p>\n<p style=\"text-align: left;\">92<br \/>\n00:02:54,842 \u2013&gt; 00:02:58,637<br \/>\nOh, my darlings.<br \/>\nThank God I found you.<\/p>\n<p style=\"text-align: left;\">93<br \/>\n00:02:58,637 \u2013&gt; 00:03:01,598<br \/>\nOh, look at<br \/>\nthose beautiful cheeks.<\/p>\n<p style=\"text-align: left;\">English company 2 consistently presents more accurate transcriptions and includes better punctuation and capital letters. But the Dutch company, although not quite as accurate in the transcription, creates timed subtitles (if you compare the timecodes with the final subtitles, you see that they are very well timed), which might compensate for some of the misunderstandings.<\/p>\n<p style=\"text-align: left;\">Regarding the other clips in English that I got transcribed, the results were not very good. The file for <em>Drugs in Hollywood<\/em> was not useful at all, but I actually expected that, being a non-scripted show full of slurred dialog and slang. Even a human transcriber would have trouble tackling that show. And the one for <em>Uncle Grandpa<\/em>, although better, still presented errors probably caused by the cartoonish voices and the outlandish topics covered, including strange names and invented words, so the software had serious problems with it, too. Clearly, this technology is still not perfected to be used on all kinds of material, and it probably performs acceptably on scripted material and single-speaker shows like newscasts, documentaries, lectures, sermons, etc.<\/p>\n<p style=\"text-align: left;\">As for the tests I performed on the Spanish clips, the results were very poor. The different accents and variants were too much for the ASR software options to compute, and the transcriptions were utterly unintelligible. Editing would have been more time-consuming than transcribing from scratch. The Portuguese was also quite bad, although better, compared to the Spanish.<\/p>\n<p style=\"text-align: left;\">I knew quite well that these tools would provide minimal assistance and extensive post-edition would be needed before resulting ASR files could be used as translation templates or for delivery to client. My next step will be to edit the English transcriptions into final deliverables and compare the time and effort each takes. I will share my findings in part 2 of this article, coming soon in the next issue of <em>Deep Focus<\/em>.<\/p>\n<p>&nbsp;<\/p>\n<\/div>\n<hr \/>\n<h5 class=\"entry-content\" style=\"text-align: left;\">Mara Campbell is an ATA-certified translator from Buenos Aires, Argentina, and has been subtitling, translating subtitles and scripts for dubbing for the past 20 years. Mara is currently COO of True Subtitles, the company she founded in 2005, that has clients in three continents. Her work has been seen on Netflix, Hulu, HBO, BBC, Amazon, and more. She teaches courses on subtitling, closed captioning, and Latin American Neutral Spanish, and has spoken in conferences in Argentina, Uruguay, and Germany.<\/h5>\n<hr class=\"wp-block-separator\" \/>\n<h6 class=\"has-small-font-size\" style=\"text-align: left;\">Published in <em>Deep Focus<\/em>, Issue 2, March 2019<\/h6>\n","protected":false},"excerpt":{"rendered":"<p>By Mara Campbell Let me start with a straight answer to the titular question so that you can continue reading with a light heart: No! Now that we can all breathe easily, I will elaborate. You might be familiar with the concept of Speech Recognition Software (SRS) if you use Siri, Alexa, or Google Now. [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_genesis_hide_title":false,"_genesis_hide_breadcrumbs":false,"_genesis_hide_singular_image":false,"_genesis_hide_footer_widgets":false,"_genesis_custom_body_class":"","_genesis_custom_post_class":"","_genesis_layout":"","footnotes":""},"categories":[58,12],"tags":[22,20,62,33,30,38],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/www.ata-divisions.org\/AVD\/wp-json\/wp\/v2\/posts\/719"}],"collection":[{"href":"https:\/\/www.ata-divisions.org\/AVD\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ata-divisions.org\/AVD\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ata-divisions.org\/AVD\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ata-divisions.org\/AVD\/wp-json\/wp\/v2\/comments?post=719"}],"version-history":[{"count":1,"href":"https:\/\/www.ata-divisions.org\/AVD\/wp-json\/wp\/v2\/posts\/719\/revisions"}],"predecessor-version":[{"id":720,"href":"https:\/\/www.ata-divisions.org\/AVD\/wp-json\/wp\/v2\/posts\/719\/revisions\/720"}],"wp:attachment":[{"href":"https:\/\/www.ata-divisions.org\/AVD\/wp-json\/wp\/v2\/media?parent=719"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ata-divisions.org\/AVD\/wp-json\/wp\/v2\/categories?post=719"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ata-divisions.org\/AVD\/wp-json\/wp\/v2\/tags?post=719"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}