User-Friendly Tuning of Unsupervised Models of Text

Topic models are a popular machine learning tool for exploration of unstructured text across numerous domains, ranging from data journalism and customer service to economics, history, and literature. However, people trying to use these models to uncover patterns in a text collection often have specific themes in mind, but wish to see if these themes arise naturally instead of via supervision. In this study, we will explore novel ways for investigators to encode modest hypotheses into evaluation metrics about what themes should arise using a small collection of ``highlights,'' or instances of phenomena of interest. Using concepts around mutual information, semantic similarity, and probabilistic modeling, participants will devise new metrics to parameterize how well themes are separated from each other by the model, as well as rank possible candidates for other highlights to see if they match human expectations about a theme. These metrics will provide guidance for a variety of activities known to consume significant investigator time with topic models, including selecting number of topics, finding frequency thresholds for key terms, and determining how to split passages of longer texts into thematically unified units.


Name of research group, project, or lab
Why join this research group or lab?

The WHISK lab combines practices from machine learning, human-centered design, and information theory to try to come up with solutions that work for digital humanists and computational social scientists. Students will have the opportunity combine skills from lots of different areas to answer questions of real practical interest to a number of practitioners. This summer's project will require creativity and ingenuity to create modular, easy-to-use standalone tools with a solid quantitative foundation. Along the way, students will hopefully have the opportunity to explore text data collections that interest them.

Logistics Information:
Project categories
Computer Science
Human-centered Design
Machine Learning
Natural Language Processing
Student ranks applicable
Student qualifications

- Some course introducing probability and statistics (familiarity with conditional probabilities, Bayes' rule, discrete distributions like the binomial and Poisson distributions)

- Independence in a programming environment (at least a data structures course, e.g. CS 70 HM or CS 62 PO, or past experience through independent projects or internships)

- Course work in a humanities, social science, or arts discipline with a significant textual component (e.g. close reading in a literature course, doing primary source analysis for a history course, working with transcripts in a psychology course, analyzing financial reports in an economics course)

Time commitment
Summer - Full Time
Paid Research
Number of openings
Techniques learned

Students will then generate their own candidate metrics, accompanying these with their own case study sets of highlights that they test on news datasets and Wikipedia articles across multiple languages. Experiments on the number of highlights necessary to be effective, strategies for automating or supporting users in the extraction of highlights. Students will release the successful metric(s) with small open-source Python modules to demonstrate their effectiveness. While the initial summer project will focus on topic models, further summers will seek to expand these metrics to first static word embeddings, then fine-tuned BERT embeddings, to see if there are ways to detect how successfully these models distinguish text from highlights

Contact Information:
Xanda Schofield
Assistant Professor
Name of project director or principal investigator
Xanda Schofield
Email address of project director or principal investigator
2 sp. | 22 appl.
Hours per week
Summer - Full Time
Project categories
Natural Language Processing (+3)
Computer ScienceHuman-centered DesignMachine LearningNatural Language Processing