by Mara Campbell
Since writing the first part of my article, I have been busy testing new automatic speech recognition (ASR) tools and comparing results. I have to admit that some softwares are quite impressive and could be considered a very helpful tool for subtitlers and captioners.
Again, my test clip was from the comedy show Brooklyn Nine Nine, because it presents interesting challenges for a subtitler, since it is a fast-paced sitcom full of staccato dialogs, jokes and punchlines, rapid (and many times interrupted) conversations, slang, and some background noise and music. All things with which an automatic speech recognition software struggles. My intention was to put the softwares through a tough test, because if they can perform acceptably under these conditions, they can be useful not only in scenarios similar to this one, but on other types of material, such as documentaries, dramatic movies and shows, newscasts, etc.
I tested four new softwares: A Swiss one, an American one, a Belgian one, and also the automatic captioning tool that a big mainstream video platform provides.
This time, I will not bore you comparing outputs of the different tools. My approach was to test how long it took me to conform the subtitles created by these softwares to the usual standards our clients require. Of the seven softwares I tried, only five produced a timed and transcribed file, so I worked with those; the two other softwares do not have yet a user interface with which to interact, but they are being developed.
Time savings
I started by subtitling the file from scratch, with no script, using a professional subtitling software which I have been using for the past decade to ensure my familiarity with it, and I timed how long the process took me. Then I opened the other files in the same software and conformed them. The conforming included correcting mishears; re-timing subtitles whenever necessary; changing line and subtitle breaks; adding punctuation, capital letters, and italics; adding hyphens to dual speaker subtitles; and whatever was necessary to bring the files up to the Netflix style guide standard. Some tools have an excellent grasp of punctuation, some lack it altogether; most of them have an impressive handling of line and subtitle breaks, and all of them have flawless timing, so each file required focusing on different aspects.
All of the softwares offer an area in their interface where the text and timing can be adjusted without having to export a file and then import it onto another desktop software, but I thought putting myself through the learning curve of five different softwares would be too time consuming and that the conditions would not be equal, with some platforms being more user-friendly, others having more editing functionalities, etc.
These are my results, working on the 10-minute clip:
Time required | Time saved | |
Origination (no script) | 122 minutes | |
Dutch software | 117 minutes | 6.5 % |
Swiss software | 103 minutes | 16 % |
Automatic captions by video platform | 90 minutes | 26 % |
Belgian software | 86 minutes | 29 % |
American software | 76 minutes | 38 % |
As you can see, even the tool that performed most poorly still saved some time. And the one that performed better saved an impressive 46 minutes. Why, even the automatic captions that we generally turn off when we watch online videos saves over half an hour’s worth of work!
Cyber-security
Of course there are many other considerations to take into account, because today’s world does not revolve around productivity, but mainly around security. Most of these tools offer very secure environments and they guarantee that the videos and files are deleted immediately when the user deletes them, but audiovisual translators these days are bound by airtight NDAs and/or do not have access to the videos as they used to: it is very common now to work on videos our clients provide within their cloud-based environment, and we do not have the possibility of downloading them and then feeding them to our tool of choice. So this narrows down the volume of work we can process through an ASR software.
Pricing
This has an impact in pricing also. There seems to be two models: the pre-paid and the per-minute systems. The pre-paid system means you have to buy a bundle of credits or minutes, which are consumed as you process videos through the system. Some companies charge a monthly fee and others a daily fee, where you can process as many minutes as you want within that timeframe. The per-minute system charges you per minute of video you feed the ASR software. Of course each user’s volume will determine which system suits them better, but some companies offer very competitive prices, perfectly affordable even for freelancers with small volumes of “ASR-processable” material.
Other features
Many tools offer transcription in several languages, one of them supporting up to an impressive 28, and all of them state they are working fast on adding more languages. The few clips I have tested in Latin American Spanish and Brazilian Portuguese yielded very acceptable results, too. And a few of these companies also add machine translation features, with some surprisingly acceptable results and very interesting approaches towards the task.
Conclusions
It is probably a matter of using these technologies with the right material: the videos we have access to (and can process through ASRs without breaching any NDA) and their theme or topic. Documentaries, interviews, sermons, some kinds of scripted material, etc., make perfect candidates for these softwares. And un-scripted programs, reality shows, game shows, talk shows, informal interviews, some sitcoms, mockumentaries, etc., might not be very good matches.
So maybe I am asking the wrong question in the title. It might not be a matter of ASRs replacing audiovisual translators, but of aiding us in our work. I recently came across a clip that shows how theatrical subtitles were made 40 years ago. It was definitively a labor of love, re-typing each subtitle, creating copper printing blocks similar to the ones used with a letterpress printer, and then embossing them by heat and pressure on the actual film. I suspect that when more sophisticated yet user-friendly and cost-effective softwares started arising, those subtitlers must have felt like they would be taken over by machines. But nowadays, none of us would make a living were it not for the wonderful softwares that aid us and make our work actually possible.
Perhaps it is a time where we must start thinking about reinventing our role. This change is going to happen, so what can we, as experienced audiovisual translators, bring to the table? In all AI matters, there is one thing that computers can’t do and will never ever be able to do: provide context. The system can transcribe and time a joke perfectly, but the human is the only one who understands where the laughs are intended to happen, hence where to split the subtitle to leave the punchline on its own, to cause a better impact on the audience. Or where to add a subtitle break to create some suspense before revealing the name of the winner of a singing competition.
We will always be needed. Our years of experience are not lost. It is just a matter of finding new plains where our expertise will be better capitalized. I personally find it a very exciting challenge. After all, who wants to do the exact same thing forever? Reinvention is the name of the game.