AI in the world of interpretation
By Dr. Holly Silvestri
Not all algorithmic decision-making systems require the same level of scrutiny. However, where these systems are making decisions about people, resources that concern people and their access to those resources, or issues that affect the ability of people to fully participate in society, they do merit a good deal of scrutiny because of their potential to do harm. Interpretation falls into these last two categories, in that the decision to use artificial intelligence (AI) for interpretation has serious potential to do harm.
What harm, you ask? That of limiting the options available to people who require language access services in order to, in turn, access other services they need.
For example, if AI interpretation (which, in many cases and for all intents and purposes, is “subtitling” an interaction) is used with no human available to monitor the computerized output and correct it, or to step into an interaction where there is confusion, then the interpretation provided very likely does more harm than good. There are those who say that something is better than nothing. AI interpretation is better than no language access at all, they argue. But if that something has the same result as nothing, then there was no language access provided.
“…limiting language access to only AI interpretation has the same effect as providing no access at all”
Therefore, limiting language access to only AI interpretation has the same effect as providing no access at all. If the AI interpretation is of lesser quality and results in the end-user’s inability to properly access services, then that person has been denied their legal right to meaningful language access for the purposes of fully participating in American society. And currently, even the best AI interpretation systems are not equal to human interpretation 100% of the time.
Because we know that the current technology used in AI interpretation does not really “interpret” but rather recycles from Large Language Model (LLM)-enhanced Neural Network Machine Translation (MT) combined with speech recognition and speech synthesis technologies, it behooves us to recognize that these multiple technologies all have significant weaknesses. This is true despite the major improvements that come from using LLMs to enhance Neural network MT results.
Data quality, quantity, and integrity
On top of this, we have the additional concern that any AI interpretation system based on LLMs is only as good as the quality and quantity of data the machine is trained on and the underlying assumptions made by those who constructed the system. And speaking of data, it is important to remember that “During the early development of LLMs, researchers could use copyright exemptions to train their creations on textbooks and novels. However, the shift to commercial deployment means developers lose those exemptions and now face lawsuits from angry publishers and creators in attempts to deny free access to their works”.[1] None of these lawsuits have been resolved in the United States.
With respect to LLMs, the data used to train them creates two other problems.
First, the source of much of the data that will feed and, in some cases, has fed these LLMs is scraped from publicly available sources on the internet. This causes two major weaknesses in the dataset:
- A good portion of the data used to train these LLMs was made up of English and maybe 7 or 8 other languages where there was enough quality data to train the system in those languages. A glance at the current data concerning which languages dominate the content of the internet should make this abundantly clear (see the image below):

- There was also an international study done which proved revealing. It showed that for non-English-language material, there is only a small amount of usable data. “A survey by an international team from 20 universities that was led by Google researchers found out of 205 Web-crawled datasets in non-English languages, 15 were unusable for pretraining. In almost 90 others, fewer than half the sentences were suitable. Even after filtering out problematic data, the information content of the data created other challenges.” [See citation below]
There is, then, a disconnect between the perception that these systems can be the panacea that solves the issue of scarcity of human interpreters for Languages of Limited Diffusion (LLDs, defined by NCIHC as “any language in a geographic area in the U.S.—like a city, county or region—where the population of speakers is relatively small” on their website: https://www.ncihc.org/languages-of-limited-diffusion ) and the reality that stems from the lack of any dataset freely available from the internet that can be used to train an LLM on any LLD.
Second, it has come to light that for the non-English languages that populate the larger internet, some of the data used to train the LLMs may have come from texts that were poorly translated from English using MT.
This brings up the issue of potentially lower-quality results in those languages that were trained on poor MT translations from English, rather than native texts.
In short, if the data used to train an LLM in a non-English language consists primarily of poorly translated non-native texts, one can hardly expect the results of that LLM’s output for those non-English languages to be stellar. Garbage in, garbage out, as the saying goes.
Dialects and datasets
And then there are languages that may have a variety of dialects because of the diaspora of its speakers. Given the limited number of non-English languages that maintain a large enough presence on the internet to furnish a viable dataset with which to train an LLM, it is very likely that the LLM would rioritize the dialect of the European country. Why? Certain countries have a stronger presence on the internet for a variety of reasons, not the least of which is the economic advantages provided by the colonization of other countries (and subsequent language diffusion). The colonizing country, then, obtained internet access earlier than the former colonies because of its stronger economic position relative to those countries.
Compounding the problem created by this issue is the fact that many–if not all- of the monolingual technical creators of AI do not recognize the fact that dialects can differ. Or if they do, they do not see this as problematic. Spanish is Spanish, right? Well… sometimes, but not always.
So, the data used to train an LLM in Spanish, for example, might be overwhelmingly Castilian Spanish as opposed to Latin American Spanish. While generally not incomprehensible to the end user, it could mean that the interpretation might be less comprehensible to the overwhelming percentage of immigrants born or raised in Latin American- that comprise the end users needing language access in the USA.
In addition, a human interpreter can quickly adjust the interpretation (to a certain extent) to match the end user’s profile (including dialect). AI interpreting does not yet have the capacity to do this on the fly. Add to this the fact that MT still has a great deal of difficulty managing idiomatic expressions and culturally bound references, and that only human-centered interpretation allows for clarification if there are language-based or other communication issues to resolve in the interpreted encounter, and it does seem that AI interpretation may be significantly less appealing to extend language access than it originally appeared—even for those limited non-English languages on which LLMs have been trained to a functional level.
Assumptions and biases
Another problem presented by AI interpretation relates to the underlying assumptions made by those who constructed the system. All humans come with biases. In fact, as human interpreters, our training teaches us to be mindful of and to try to reduce our implicit biases when we work.
However, an algorithmic decision-making system is also biased to a certain extent, regardless of the good intentions of its creator. The problem arises when the creator of the system uses an algorithm that is unintentionally reflective of their biases, thereby embedding those biases into the system. The image below makes clear how much human intervention is often truly involved in these systems.

How does this play out in AI interpretation? We have already seen the earlier example which saw the prioritization of one dialect of a language (Castilian Spanish) over all others, regardless of the end user’s needs. This is an example of both unconscious bias on the part of the technical creators and systemic bias which prioritizes the Castilian dialect over those from Latin America.
Additionally, there is a problematic common perception that because they are based on mathematical formulas, AI-based systems are LESS biased than human beings. The argument used is that math is objective and unbiased, so (“obviously”) these systems, which are based in math, are clearly superior to humans, who can be biased and not even aware of their biases.
In short, the public often assumes that through math, these systems eliminate bias. Math cannot be biased, as it is a system with only one correct answer. One plus one always equals two and never three, right?
However, this assumption is categorically incorrect, as the above image of an algorithmic decision-making system demonstrates. Despite the mathematical foundation, humans make these systems. The systems are necessarily reflective of not only human biases but also systemic ones. The creators of this technology can unwittingly embed these biases into a system, which is an especially harmful outcome for those humans on the losing end of those biases.
Proceed with caution
So, when can AI interpretation be used? The seemingly simple answer is in low-risk situations. Unfortunately, there is no currently accepted definition of what “low risk” situations are. And any attempt to define them quickly runs into exceptional cases which defy that definition.
For example, one could assume that use in tourism might be low risk much of the time. So, using AI interpretation to, say, order food in a restaurant, seems to present little potential for significant harm. But…. then you run into a situation where someone who has a life-threatening allergy orders through an AI-operated system. Suddenly, there is the potential for real harm because AI systems sometimes hallucinate (i.e., if they don’t have the information, they make it up. One might say even say they can lie.) So, if you were allergic to shrimp, would you trust a machine with a propensity to lie to tell you honestly whether there is shrimp in the dish you just ordered?
I know that with the current system used for AI interpretation, I sure as heck wouldn’t. At least not if I didn’t have my EpiPen with me….
[1] Chris Edwards, Communications of the ACM, “Data Quality May Be All You Need”, March 28, 2024. https://cacm.acm.org/news/data-quality-may-be-all-you-need/#references
Holly Silvestri is the senior coordinator for translation, training, and curriculum for the National Center for Interpretation at the University of Arizona. She has also taught for the Translation and Interpretation Program within the Spanish and Portuguese Department, which offers a bachelor’s degree in Spanish with a concentration in translation and interpreting. She has experience in the fields of translation and interpreting as well as in training interpreters, and is a member of the National Language Service Corps. Her working languages are Spanish, French, and English. She also runs her own language services provider business.
This article is closed to comments on the blog. We welcome discussion on the member forums below:
In addition to our email forum and the range of services on our website, such as the Blog and a Resources page, the ATA Interpreters Division invites members to connect with us on social media. Join the conversations on LinkedIn, Instagram, TikTok, X, and Facebook!