Topic models are a popular machine learning tool for exploration of unstructured text across numerous domains, ranging from data journalism and customer service to economics, history, and literature. However, people trying to use these models to uncover patterns in a text collection often have specific themes in mind, but wish to see if these themes arise naturally instead of via supervision. In this study, we will explore novel ways for investigators to encode modest hypotheses into evaluation metrics about what themes should arise using a small collection of ``highlights,'' or instances of phenomena of interest. Using concepts around mutual information, semantic similarity, and probabilistic modeling, participants will devise new metrics to parameterize how well themes are separated from each other by the model, as well as rank possible candidates for other highlights to see if they match human expectations about a theme. These metrics will provide guidance for a variety of activities known to consume significant investigator time with topic models, including selecting number of topics, finding frequency thresholds for key terms, and determining how to split passages of longer texts into thematically unified units.
The WHISK lab combines practices from machine learning, human-centered design, and information theory to try to come up with solutions that work for digital humanists and computational social scientists. Students will have the opportunity combine skills from lots of different areas to answer questions of real practical interest to a number of practitioners. This summer's project will require creativity and ingenuity to create modular, easy-to-use standalone tools with a solid quantitative foundation. Along the way, students will hopefully have the opportunity to explore text data collections that interest them.