Text Data Adventures: Laptop-Scale Text Analysis

This independent study will provide an introduction to tools of text mining for digital humanities and computational social science on a new dataset of the student’s choosing. This includes practice implementing data cleaning and data visualization pipelines in Python using libraries like nltk, scikit-learn, pandas, and matplotlib. In addition, students will build conceptual understanding of "laptop-scale" computational tools used in the digital humanities, such as naïve Bayes classifiers, topic models, (small) language models, and statistical tests. The project will result in a short (5-8 printed pages) report in blog post format that describes the hypothesis explored with respect to the textual data of interest, literature review, and visualized results from the performed analyses.

Disclaimers:

  • We'll be reading a handful of academic papers as part of this independent study.
  • We'll be finding data (either pre-collected or scraped) from online sources. These are often messy; expect to spend some time cleaning data to make it usable!
  • I am not anticipating having research students from Summer 2025 through Summer 2026; this would be a one-semester commitment and would not commit you to a lab long-term.
  • We will not be using large language models in these projects. While some digital humanists and computational social scientists use ChatGPT, etc. for their work, this independent study will focus on more established and interpretable techniques for text exploration.

 

Name of research group, project, or lab
WHISK Lab
Why join this research group or lab?
  • Build Python data science skills
  • Practice synthesizing grounded domain knowledge with statistical/computational tools
  • Explore new datasets in all their messy glory
Logistics Information:
Project categories
Computer Science
Student ranks applicable
First-year
Sophomore
Junior
Student qualifications

Students should have completed an introductory Python course (e.g. CS 5). Students will also need to be available at 1:15-2:30 Mondays for weekly meetings.

Time commitment
Spring - Part Time
Compensation
Academic Credit
Number of openings
6
Techniques learned

By the end of this project, students should be able to articulate what a viable text pre-processing pipeline is, navigate several popular open-source Python libraries to build such a pipeline, analyze outputs of several simple machine learning models, and articulate these outputs in a textually-driven context. Students will also gain additional practice building effective data visualizations and writing about a computational process for a general audience.

Project start
Spring 2024
Contact Information:
Mentor
aschofield@hmc.edu
Assistant Professor
Name of project director or principal investigator
Xanda Schofield
Email address of project director or principal investigator
xanda@cs.hmc.edu
6 sp. | 1 appl.
Hours per week
Spring - Part Time
Project categories
Computer Science