IMPORTANT
Bryan Li's talk "Towards Multilingual Evaluations of Knowledge for Large Language Models"
Tuesday, October 14, 2025 · 2 - 3 PM
Abstract:
Contemporary language models (LMs) support dozens of languages,
promising to broaden information access for global users. However,
existing multilingual evaluations largely study factual recall tasks,
failing to address knowledge-intensive tasks shaped by the uneven
coverage and different perspectives of knowledge across languages. This
dissertation investigates how LMs handle such tasks by examining their
internal parametric knowledge and their use of externally-provided
contextual knowledge. In the first part, I introduce benchmarks for
complex reasoning and territorial disputes, and find that LM responses
on both tasks exhibit a lack of cross-lingual robustness, outputting
inconsistent answers to underlying queries written in different
languages. I then show that lightweight methods of leveraging program
code and persona-based prompting can mitigate these issues.
In the second part, I explore the retrieval-augmented generation (RAG) setting, which combines LM's internal parametric knowledge with contextual knowledge from external knowledge bases (KBs). Focusing on the territorial disputes task, I show that while RAG over single-language or single-source KBs has mixed effects on robustness, retrieving over multilingual and multi-source KBs — Wikipedia, as well as a large-scale dataset of state media articles I collected — substantially boosts robustness. Together, these findings highlight the need for LMs that can navigate, and assist users in navigating, the real-world distribution of knowledge across languages and sources. This is a practice dissertation talk, and your feedback would be greatly appreciated!
In the second part, I explore the retrieval-augmented generation (RAG) setting, which combines LM's internal parametric knowledge with contextual knowledge from external knowledge bases (KBs). Focusing on the territorial disputes task, I show that while RAG over single-language or single-source KBs has mixed effects on robustness, retrieving over multilingual and multi-source KBs — Wikipedia, as well as a large-scale dataset of state media articles I collected — substantially boosts robustness. Together, these findings highlight the need for LMs that can navigate, and assist users in navigating, the real-world distribution of knowledge across languages and sources. This is a practice dissertation talk, and your feedback would be greatly appreciated!
--
Bryan Li is a final-year PhD student at the University of
Pennsylvania, advised by Prof. Chris Callison-Burch. His research
focuses on multilingual evaluations of LLMs, spanning both the fields of
natural language processing and computational social science. His work
has appeared in conferences such as ACL, COLM, and ICLR. Outside of
research, you can find him in a trendy cafe, a river-side running trail,
or at home listening to a good podcast.