Topic Modeling + Structured Priors for Text-Driven Science
New techniques to extract useful scientific data from text
Monday, March 2, 2015 · 12 - 1 PM
Michael Paul from Johns Hopkins University will talk about his research on developing better techniques for extracting useful information from text.
Many scientific disciplines are being revolutionized by the explosion of public data on the web and social media, particularly in health and social sciences. For instance, by analyzing social media messages, we can instantly measure public opinion, understand population behaviors, and monitor events such as disease outbreaks and natural disasters. Taking advantage of these data sources requires tools that can make sense of massive amounts of unstructured and unlabeled text. Topic models, statistical models that describe low-dimensional representations of data, can uncover interesting latent structure in large text datasets and are popular tools for automatically identifying prominent themes in text. However, to be useful in scientific analyses, topic models must learn interpretable patterns that accurately correspond to real-world concepts of interest.
In this talk, he will introduce Sprite, a family of topic models that can encode additional structures such as hierarchies, factorizations, and correlations, and can incorporate supervision and domain knowledge. Sprite extends standard topic models by formulating the Bayesian priors over parameters as functions of underlying components, which can be constrained in various ways to induce different structures. This creates a unifying representation that generalizes several existing topic models, while creating a powerful framework for building new models. He will describe a few specific instantiations of Sprite and show how these models can be used in various scientific applications, including extracting self-reported information about drugs from web forums, analyzing healthcare quality in online reviews, and summarizing public opinion in social media on issues such as gun control.
Michael Paul is a PhD candidate in Computer Science at Johns Hopkins University. He earned an M.S.E. in CS from Johns Hopkins University in 2012 and a B.S. in CS from the University of Illinois at Urbana-Champaign in 2009. He has received PhD fellowships from Microsoft Research, the National Science Foundation, and the Johns Hopkins University Whiting School of Engineering. His research focuses on exploratory machine learning and natural language processing for the web and social media, with applications to computational epidemiology and public health informatics.