In This Story
Large Language Models (LLM)—such as ChatGPT—require a significant amount of language data to appropriately recreate it, and for that reason the best models are often trained on mass amounts of internet content. One unfortunate result of this is LLMs lack fluency and capabilities in many languages simply because there isn’t enough data on the internet about those languages. Antonios Anastasopoulos, an expert in natural language processing, received a $600,000 CAREER award from the National Science Foundation (NSF) to address this exact problem.
It is no surprise that Anastasopoulos is a leading researcher in the field, having grown up in a home that appreciated language. “My mother taught English in Greece, where I am from, where she had a language school. I’ve always loved languages. I took German and Swedish before I graduated from high school and in college I tried out Italian, Russian, and Chinese.”  
 
This gave him tremendous insight into the notion of language uptake. “Think of the way that kids learn - any child can acquire a language with orders of magnitude less data than what we’ve given ChatGPT,” said Anastasopoulos. “But another learning mode humans often use is this: I already know one language, say English, and I take lessons where someone explains to me in English how to learn another language, like Spanish. I do some exercises and I make associations and I learn the new vocabulary and grammar and eventually I can become proficient in Spanish.”  But the challenge, he says, is that we do not teach our LLMs this way. “We don’t have data online for 2000 languages—only about 200. But we do have grammatical descriptions of them because linguists have written books about these languages. If we can teach an LLM what’s in those books, along with additional data, we should be able to make the models proficient in these languages.”  
In addition to building appropriate datasets, the project will make theoretical connections to various learning paradigms. This approach will attempt to model the process of multilingual learning itself and seek to understand the errors exhibited by LLMs.
The research will be integrated into education through new teaching modules, combining linguistics with natural language processing and by promoting undergraduate research. The project will extend its impact beyond the classroom through close collaboration with underserved language communities, aiming to build technologies according to the communities' needs.
Anastasopoulos, an assistant professor in computer science, focused his PhD on endangered languages and notes the importance of preserving languages. “Each language stores or codifies part of human knowledge, and language loss eventually means loss of that unique knowledge,” he said. “There is value in preserving and documenting that knowledge. That’s why this whole thing matters.”
The NSF CAREER award is reserved for the nation’s most talented up-and-coming researchers. From the NSF website: “The Faculty Early Career Development (CAREER) Program offers NSF’s most prestigious award in support of early-career faculty who have the potential to serve as academic role models in research and education and to lead advances in the mission of their department or organization.”
In addition to this honor, in 2024 Anastasopoulos won the GMU Presidential Award for Faculty Excellence in Research for his research contributions in the domain of computational linguistics and natural language processing. Earlier in 2024, his paper “DialectBench: An NLP Benchmark for Dialects, Varieties, and Closely-Related Languages” received one of the Best Social Impact Paper Awards at ACL, the premier NLP conference, for showing that LLMs cannot handle dialects as well as standard language varieties.
