AMISTAD Lab: Proving the Limits of Large Language Models
This project is part of an NSF-funded REU site with the HMC Computer Science Department. You can learn about other REU projects here: https://sites.google.com/g.hmc.edu/summer2025reu/home.
(US citizens or permanent residents) You will need to apply via the ETAP website (linked here), not on this URO site!
NOTE: If you are not a US citizen or permanent resident, please apply here (on URO). I may have one additional slot open, depending on if funding comes through.
Project Description
Can unintelligent, mechanical processes produce unlimited amounts of meaningful information, far beyond what is input into such systems? GPT-4 and other Large Language Models (LLMs) have been proffered as such “perpetual information machines.” These systems are a giant leap forward for AI. They can output coherent English text, perform language translation, problem solve, and write code. They also make mistakes, invent facts, and still fail to understand some common sense situations, such as that if they play rock-paper-scissors with someone, and go first, they will always lose. Even with these shortcomings, LLMs trained on limited data can produce what appears to be limitless text. This text is largely coherent and conforms to the external specification of English grammar. It is also complex. Yet it is produced by a fully mechanical process ostensibly unguided by human thought or additional human interference.
This project will focus on the specific question: can LLMs produce more functional information than is input into such systems via their training data, initialization, and human prompting? The work will likely be proof-focused with a supplemental empirical, simulation component.
Implications
The question of whether or not LLMs are true information-generation systems has significant implications for technology, society, and our understanding of artificial general intelligence. A negative answer to the research question could help us understand the phenomenon of “model collapse,” where generative AI systems degrade when trained exclusively on their own outputs. Such an answer would also inform the ethical discussion around data dignity and the impact of copyrighted material used in training LLMs. If LLMs are merely regurgitative interpolation systems, their outputs would not represent new content, but merely recombined source material. Thus, a better understanding of what LLMs actually produce rather than repurpose could guide legal theory on whether LLM training represents fair use or copyright infringement. It could also open the way to fairer policies surrounding data scraping and compensation for data use in AI training. Lastly, if LLMs cannot create more information than they consume, then this would be evidence against the argument that AI systems will eventually replace humans in all areas of intellectual work. There would be formal, significant differences between human and artificial system capabilities which could not be overcome with more training. This would represent a hard edge for the question of general AI, which would need to be addressed in all future discussions. It would impact our understanding of what it means to be intelligent, and whether consciousness can be fully mechanized in computational systems.
- We are usually a productive lab. You can be productive with us.
- Students become well-trained in academic research, having a high degree of project autonomy and responsibility.
- You can work on open-ended projects at the frontiers of current scientific knowledge. We're not afraid to push boundaries. You must be comfortable with uncertainty and have the grit to persist when the way forward isn't clear.
- We have daily meetings to keep you on track and hold our teams accountable. You can work together with great team-members!
- You can get articles about your research featured on the Mudd website:
https://www.hmc.edu/about/2025/01/17/harvey-mudd-student-kerria-pang-naylor-earns-prestigious-cra-outstanding-undergraduate-researcher-award/
https://magazine.hmc.edu/spring-2022/this-kind-of-fun/
https://www.hmc.edu/about/2024/01/11/cs-research-sets-boundaries-for-two-distribution-hypothesis-testing/
https://www.hmc.edu/about-hmc/2022/01/14/icaart-2022-highlights-hmc-cs-research/,
https://www.hmc.edu/about-hmc/2021/04/29/harvey-mudd-cs-students-publish-and-present-work/,
https://www.hmc.edu/about-hmc/2021/03/16/cs-research-published/,
https://www.hmc.edu/about-hmc/2020/07/07/cs-clinic-team-publishes-transfer-learning-research/,
https://www.hmc.edu/about-hmc/2020/03/05/cs-researchers-win-best-paper-at-icaart/,
https://www.hmc.edu/about-hmc/2020/01/14/cs-researchers-present-findings-in-machine-learning-algorithms/ - We strongly believe in the importance of mentoring:
https://www.hmc.edu/about-hmc/2021/05/14/2021-harvey-mudd-college-leadership-awards/
https://www.hmc.edu/about-hmc/2020/05/12/montanez-receives-diversity-mentor-award/