Text Data Adventures: Laptop-Scale Text Analysis

This independent study will provide an introduction to tools of text mining for digital humanities and computational social science on a new dataset of the student’s choosing. This includes practice implementing data cleaning and data visualization pipelines in Python using libraries like nltk, scikit-learn, pandas, and matplotlib. In addition, students will build conceptual understanding of "laptop-scale" computational tools used in the digital humanities, such as naïve Bayes classifiers, topic models, (small) language models, and statistical tests. The project will result in a short (5-8 printed pages) report in blog post format that describes the hypothesis explored with respect to the textual data of interest, literature review, and visualized results from the performed analyses.

Disclaimers:

We'll be reading a handful of academic papers as part of this independent study.
We'll be finding data (either pre-collected or scraped) from online sources. These are often messy; expect to spend some time cleaning data to make it usable!
I am not anticipating having research students from Summer 2025 through Summer 2026; this would be a one-semester commitment and would not commit you to a lab long-term.
We will not be using large language models in these projects. While some digital humanists and computational social scientists use ChatGPT, etc. for their work, this independent study will focus on more established and interpretable techniques for text exploration.

Name of research group, project, or lab

WHISK Lab

Why join this research group or lab?

Build Python data science skills
Practice synthesizing grounded domain knowledge with statistical/computational tools
Explore new datasets in all their messy glory

Logistics Information:

Computer Science

Student ranks applicable

First-year

Sophomore

Junior

Student qualifications

Students should have completed an introductory Python course (e.g. CS 5). Students will also need to be available at 1:15-2:30 Mondays for weekly meetings.

Spring - Part Time

Compensation

Academic Credit

Techniques learned

By the end of this project, students should be able to articulate what a viable text pre-processing pipeline is, navigate several popular open-source Python libraries to build such a pipeline, analyze outputs of several simple machine learning models, and articulate these outputs in a textually-driven context. Students will also gain additional practice building effective data visualizations and writing about a computational process for a general audience.

Project start

Spring 2024

Contact Information:

Mentor

Xanda Schofield

aschofield@hmc.edu

Assistant Professor

Name of project director or principal investigator

Xanda Schofield

Email address of project director or principal investigator

xanda@cs.hmc.edu

6 sp. | 20 appl.

Spring - Part Time

Computer Science

Text Data Adventures: Laptop-Scale Text Analysis

Related ProjectRelated Projects