Improving Topic Modeling Tools for Novice Text Miners

Probabilistic topic models are widely used outside of computer science to find patterns of meaning in large text collections. However, like a lot of other natural language processing tools, their effectiveness is often a function of choices about data preparation and model configuration. Making these decisions often requires iterating through many options, which is is particularly hard if you have limited machine learning or natural language processing background: from interviews our team conducted last summer, we found that it could take researchers many months to successfully reach the point of training a usable model. Our research question is this: how can we design a tool that supports iterative refinement of topic models for a user base with limited programming experience but deep textual questions?

This summer, we will build on work from last summer to develop jsLDA 2.0, a revision of a small web-based topic modeling interface that streamlines common “loops” and workflows described by these users. We will then perform user studies with both novices and experts to further develop this tool, alongside an accompanying tutorial suitable for digital humanities and computational social science classrooms. Students working on this project will practice skills of data processing for text, visualization, web development, and user study design.

For more information about our project from last summer, check out our short video here.

Name of research group, project, or lab
The WHISK lab (Workflows for Humanistic Understanding with Statistical Knowledge)
Why join this research group or lab?

This project is extra-exciting in that it's very "meta": not only do we talk about machine learning, user interfaces, and social science, we also talk about how we talk about those things to different audiences. Working on this project will also give the opportunity to meet people doing research that combines computing and culture and to practice a type of computational design that centers human inquiry. In the process, you'll get the chance to use the tool you build to study texts you're excited about, too! 

Representative publication
Logistics Information:
Project categories
Computer Science
Student ranks applicable
Student qualifications

Coursework Requirements: Data Structures (CS 70) or equivalent, Linear Algebra

Helpful Skills (not required): Web Development (CSS, JavaScript, ReactJS), Probability, Statistics, Machine Learning, Natural Language Processing, Software Development (Git, GitHub, documentation and code review)

Time commitment
Summer - Full Time
Paid Research
Number of openings
Techniques learned

Some tasks you're likely to do include:
- practicing web development (using JavaScript with ReactJS) and some principles of user interface design
- mathematically engaging with machine learning algorithms and their error scenarios
- implementing, testing, and visualizing information theoretic measures to find where models are "misbehaving"
- talking to researchers in the humanities and social science about their work and the methodologies they've used
- diving into the cross-disciplinary work of text mining for cultural analysis
- communicating through papers, posters, blog posts, and videos about scientific work and its impact

Contact Information:
Mentor name
Xanda Schofield
Mentor email
Mentor position
Assistant Professor
Name of project director or principal investigator
Xanda Schofield
Email address of project director or principal investigator
3 sp. | 31 appl.
Hours per week
Summer - Full Time
Project categories
Computer Science