Text Data Adventures: Laptop-Scale Text Analysis
This independent study will provide an introduction to tools of text mining for digital humanities and computational social science on a new dataset of the student’s choosing. This includes practice implementing data cleaning and data visualization pipelines in Python using libraries like nltk, scikit-learn, pandas, and matplotlib. In addition, students will build conceptual understanding of "laptop-scale" computational tools used in the digital humanities, such as naïve Bayes classifiers, topic models, (small) language models, and statistical tests. The project will result in a short (5-8 printed pages) report in blog post format that describes the hypothesis explored with respect to the textual data of interest, literature review, and visualized results from the performed analyses.
Disclaimers:
- We'll be reading a handful of academic papers as part of this independent study.
- We'll be finding data (either pre-collected or scraped) from online sources. These are often messy; expect to spend some time cleaning data to make it usable!
- I am not anticipating having research students from Summer 2025 through Summer 2026; this would be a one-semester commitment and would not commit you to a lab long-term.
- We will not be using large language models in these projects. While some digital humanists and computational social scientists use ChatGPT, etc. for their work, this independent study will focus on more established and interpretable techniques for text exploration.
- Build Python data science skills
- Practice synthesizing grounded domain knowledge with statistical/computational tools
- Explore new datasets in all their messy glory