AMISTAD Lab: Proving the Limits of Large Language Models

This project is part of an NSF-funded REU site with the HMC Computer Science Department. You can learn about other REU projects here: https://sites.google.com/g.hmc.edu/summer2025reu/home.

(US citizens or permanent residents) You will need to apply via the ETAP website (linked here), not on this URO site!

NOTE: If you are not a US citizen or permanent resident, please apply here (on URO). I may have one additional slot open, depending on if funding comes through.

Project Description

Can unintelligent, mechanical processes produce unlimited amounts of meaningful information, far beyond what is input into such systems? GPT-4 and other Large Language Models (LLMs) have been proffered as such “perpetual information machines.” These systems are a giant leap forward for AI. They can output coherent English text, perform language translation, problem solve, and write code. They also make mistakes, invent facts, and still fail to understand some common sense situations, such as that if they play rock-paper-scissors with someone, and go first, they will always lose. Even with these shortcomings, LLMs trained on limited data can produce what appears to be limitless text. This text is largely coherent and conforms to the external specification of English grammar. It is also complex. Yet it is produced by a fully mechanical process ostensibly unguided by human thought or additional human interference.

This project will focus on the specific question: can LLMs produce more functional information than is input into such systems via their training data, initialization, and human prompting? The work will likely be proof-focused with a supplemental empirical, simulation component.

Implications

The question of whether or not LLMs are true information-generation systems has significant implications for technology, society, and our understanding of artificial general intelligence. A negative answer to the research question could help us understand the phenomenon of “model collapse,” where generative AI systems degrade when trained exclusively on their own outputs. Such an answer would also inform the ethical discussion around data dignity and the impact of copyrighted material used in training LLMs. If LLMs are merely regurgitative interpolation systems, their outputs would not represent new content, but merely recombined source material. Thus, a better understanding of what LLMs actually produce rather than repurpose could guide legal theory on whether LLM training represents fair use or copyright infringement. It could also open the way to fairer policies surrounding data scraping and compensation for data use in AI training. Lastly, if LLMs cannot create more information than they consume, then this would be evidence against the argument that AI systems will eventually replace humans in all areas of intellectual work. There would be formal, significant differences between human and artificial system capabilities which could not be overcome with more training. This would represent a hard edge for the question of general AI, which would need to be addressed in all future discussions. It would impact our understanding of what it means to be intelligent, and whether consciousness can be fully mechanized in computational systems.

Name of research group, project, or lab

AMISTAD Lab

Website

https://www.cs.hmc.edu/AMISTAD

Why join this research group or lab?

We are usually a productive lab. You can be productive with us.
Students become well-trained in academic research, having a high degree of project autonomy and responsibility.
You can work on open-ended projects at the frontiers of current scientific knowledge. We're not afraid to push boundaries. You must be comfortable with uncertainty and have the grit to persist when the way forward isn't clear.
We have daily meetings to keep you on track and hold our teams accountable. You can work together with great team-members!
You can get articles about your research featured on the Mudd website:
https://www.hmc.edu/about/2025/01/17/harvey-mudd-student-kerria-pang-naylor-earns-prestigious-cra-outstanding-undergraduate-researcher-award/
https://magazine.hmc.edu/spring-2022/this-kind-of-fun/
https://www.hmc.edu/about/2024/01/11/cs-research-sets-boundaries-for-two-distribution-hypothesis-testing/
https://www.hmc.edu/about-hmc/2022/01/14/icaart-2022-highlights-hmc-cs-research/,
https://www.hmc.edu/about-hmc/2021/04/29/harvey-mudd-cs-students-publish-and-present-work/,
https://www.hmc.edu/about-hmc/2021/03/16/cs-research-published/,
https://www.hmc.edu/about-hmc/2020/07/07/cs-clinic-team-publishes-transfer-learning-research/,
https://www.hmc.edu/about-hmc/2020/03/05/cs-researchers-win-best-paper-at-icaart/,
https://www.hmc.edu/about-hmc/2020/01/14/cs-researchers-present-findings-in-machine-learning-algorithms/
We strongly believe in the importance of mentoring:
https://www.hmc.edu/about-hmc/2021/05/14/2021-harvey-mudd-college-leadership-awards/
https://www.hmc.edu/about-hmc/2020/05/12/montanez-receives-diversity-mentor-award/

Representative publication

hom-2023-FSBFTDHT.pdf

Logistics Information:

Computer Science

Mathematics

Artificial Intelligence

Machine Learning

Student ranks applicable

First-year

Sophomore

Junior

Senior

Student qualifications

Strong math skills are a plus. Coding skills are good to have as well. Grit and enthusiasm an mandatory.

Summer - Full Time

Compensation

Paid Research

Techniques learned

Mathematical proof techniques, scientific writing, research experiment coding, data analysis, data visualization, and scientific communication.

Project start

May 26th

Contact Information:

Mentor

George Montanez

gmontanez@hmc.edu

Research Advisor (Lab Director)

Name of project director or principal investigator

George Montañez

Email address of project director or principal investigator

gmontanez@g.hmc.edu

3 sp. | 18 appl.

Summer - Full Time

Machine Learning (+3)

AMISTAD Lab: Proving the Limits of Large Language Models

Related ProjectRelated Projects