IMPORTANT
Reno Kriz's talk on "Takeaways from the SCALE 2024 Workshop on Event-Centric Video Retrieval"
Tuesday, October 8, 2024 · 1:30 - 2:30 PM
Abstract:
Information dissemination for current events has traditionally consisted
of professionally collected and produced materials, leading to large
collections of well-written news articles and high-quality videos. As a
result, most prior work in event analysis and retrieval has focused on
leveraging this traditional news content, particularly in English.
However, much of the event-centric content today is generated by
non-professionals, such as on-the-scene witnesses to events who hastily
capture videos and upload them to the internet without further editing;
these are challenging to find due to quality variance, as well as a lack
of text or speech overlays providing clear descriptions of what is
occurring. To address this gap, SCALE 2024, a 10-week research workshop
hosted at the Human Language Technology Center of Excellence (HLTCOE),
focused on multilingual event-centric video retrieval, or the task of
finding videos about specific current events. Around 50 researchers and
students participated in this workshop and were split up into five
sub-teams. The Infrastructure team focused on developing MultiVENT 2.0, a
challenging new video retrieval dataset consisting of 20x more videos
than prior work and targeted queries about specific world events across
six languages. The other teams worked on improving models from specific
modalities, specifically Vision, Optical Character Recognition (OCR),
Audio, and Text. Overall, we came away with three primary findings:
extracting specific text from a video allows us to take better advantage
of powerful methods from the text information retrieval community; LLM
summarization of initial text outputs from videos is helpful, especially
for noisy text coming from OCR; and no one modality is sufficient, with
fusing outputs from all modalities resulting in significantly higher
performance.
--
Reno Kriz is a research scientist at the Johns Hopkins University
Human Language Technology Center of Excellence (HLTCOE). His primary
research interests involve leverage large pre-trained models for a
variety of natural language understanding tasks, including those
crossing into other modalities, e.g., vision and speech understanding.
These multimodal interests have recently involved the 2024 Summer Camp
for Language Exploration (SCALE) on event-centric video retrieval and
understanding. He received his PhD from the University of Pennsylvania
where he worked with Chris Callison-Burch and Marianna Apidianaki on
text simplification and natural language generation. Prior to that, he
received BA degrees in Computer Science, Mathematics, and Economics from
Vassar College.