Flexible, interdisciplinary computing for understanding the atmosphere

This independent study course (taken for credit as CHEM 150) will engage students in an active research program in atmospheric chemistry and climate science. The projects students will be contributing to center around building flexible, interdisciplinary computational tools to improve our ability to simulate Earth’s atmosphere or communicate findings from those simulations. The main project most students will join is focused on determining the applicability and limitations of modern machine learning approaches for simulating air quality. Additional ongoing group projects include simulating chemical reactions with finite state machines (cellular automata), improving scientific figures by using saliency algorithms (computer vision), and identifying features of scientific papers that make them more likely to contribute to policy change (natural language processing). In all projects, students will learn fundamentals of atmospheric chemistry and gain familiarity with the computational techniques and data handing skills relevant for their specific research tasks. 

Why take part in this independent study project? Through concentrated field campaigns, long-term air quality monitoring programs, and satellites, vast amounts of data are being collected by atmospheric scientists to help shape our understanding of the composition and chemistry of our atmosphere. This is especially important as we observe rapid changes to this composition related to human activity. In parallel, computational models are growing in complexity, and requiring more resources to provide the high-resolution insights we need to make actionable simulations of air quality and climate change. Impactful research in this field relies not only on being able to process big data sets and produce robust computational simulations of a complex, nonlinear system (the atmosphere), but also on being able to make insights from those data sets and simulations accessible. These research projects take on that large goal through a variety of different lenses.

What if our models could be more accurate and require less computing power?  

Project 1. Machine Learning for Air Quality Prediction: Applications of modern machine learning (ML) techniques have recently become an area of major interest in the air quality and climate community, but there are valid criticisms about the use of some of these approaches for gaining process level information. This is especially true for deep learning methods that don’t intuitively provide insights into the chemical or physical mechanisms behind trends identified in the data. The current state-of-the-art in air quality and Earth system modelling is highly mechanistic, where large systems of coupled ordinary differential equations representing chemical reactions (air quality) are solved coupled with discretized partial differential equations representing physical dynamics (weather). Unfortunately, they are very computationally costly to run, and require use of supercomputers or large clusters and have significant runtimes (climate length simulations can take years to run!). Computationally “cheap” ML models, if pre-trained, can be run on more conventional computers in a fraction of the time, potentially making this area of research far more accessible.  Before the use of these techniques becomes more widespread, there is an important question the community needs to answer: Within the large pool of data already collected, are there datasets that would be robust enough to train a ML model on that could then make meaningful predictions about future states of the atmosphere, given the large non-linearities that exist in both the physical and chemical dynamics of the system? Further, do “explainable” ML techniques work well enough to give us confidence in those simulations by showing physically plausible correlations driving our predictions or are spurious correlations giving us false confidence in our ML models?

All students working on this project will be:

  • Downloading air quality and meteorology datasets
  • Assessing the error and potential bias in said datasets
  • Gap-filling the training data
  • Training machine learning models (classic ML models and deep networks)
  • Performing model validation experiments
  • Creating informative plots showing model performance

Continuing students will be:

  • Training models (random forest, 1D CNN, LSTM) for ozone prediction using global meteorology and emission inventory data as features
  • Further tuning loss functions for extreme event prediction
  • Finishing explainability experiments
  • Finishing data bias experiments
  • Finishing overfitting tests
  • Contributing to manuscript writing and editing

New projects can involve:

  • Investigating 2D model architectures (graphical neural nets, 2D CNNs)
  • Investigating potential for near real-time air quality prediction based off weather forecast (HRRR) data
  • Identifying potential features to include for aerosol prediction
  • Using LSTMs and/or fuzzy number models to fill data gaps
  • Alternative explainability metrics for deep networks
  • Physics-based loss functions for ozone prediction
  • Identifying potential adversarial data sets (what could someone do to intentionally build a misleading model and how could we tell if they were doing it?)               

Project 2. Cellular Automata for Atmospheric Chemistry: Traditional, deterministic air quality models can only include processes we can write down equations for and solve. Unfortunately, for many things we’d like to include, we either don’t know all the relevant equations or those equations aren’t compatible with our model scale. There are several processes that we know to be important for air quality – like the formation and growth of aerosols or chemical reactions on surfaces – for which this is true. Aerosol and surface chemistry are extremely complex, multi-phase systems that undergo nonlinear chemical reactions. These reactions occur on timescales ranging from fractional seconds to weeks. While no atmospheric models can currently incorporate all the relevant processes and scales needed to describe this system fully, the successful application of cellular automata (CA) to the modelling of other chemical systems warrants examination in the atmospheric context. CA are well suited for the prediction of emergent phenomena and can produce realistic simulations of complex reaction-diffusion systems, with greater computational efficiency than traditional, differential-equation based approaches. Can we find probabilistic rules, based on first principal chemistry and physics, that can simulate these atmospheric emergent phenomena?

      Students working on this project will be:

  • Programming automata to simulate simple (2-4 component), reasonably well-studied, chemical systems:
    1. dissolving a salt in water
    2. acid-base neutralization
    3. drying a salt solution
    4. partitioning of a compound between two-phases
    5. a simple diffusion-limited aggregation process
  • Identifying what known (theoretical and/or measured) parameters must be used to validate those simulations (rate constants, equilibrium constants, etc.)
  • (Attempting) to program automata to simulate atmospherically relevant multi-phase chemistry
    1. new particle formation
    2. aerosols growth and aging

What if we could improve how we communicate findings from our models to make our work more likely to have an impact on society?  

Project 3. Saliency for Scientific Figures (collaboration with Prof. Calden Wloka, HMC CS): Condensing complex atmospheric observations and model output into impactful scientific figures is a critical part of science communication and an essential step in presenting novel research to policy makers. Literature out of the climate psychology community has suggested that how humans view figures (the order we look at elements in a figure, how long we spend looking at individual components) can impact not only what information we gain from them, but also how we feel about that information. Human eye-tracking data is thought to provide helpful feedback on figure design, but this is largely impractical for most scientists due to the barriers posed by accurately gathering eye tracking data (you need costly, specialized equipment and a statistically robust sample of humans viewing your figures). It may be possible to approximate this feedback in an automated fashion, however. Within computer vision, the area of saliency modelling has focused on designing tools that can predict likely patterns of viewing by human observers. Unfortunately, saliency modelling has largely focused on predictions for natural scenes during free viewing (i.e., looking at real photographs without specific instructions), conditions that may not translate well to scientific figures. Will saliency models provide useful information when translated to the domain of scientific figures? Could we optimize models for scientific figures if we have data from humans to do it? Are there other tools we could use to gather human data that could be a low-cost proxy for eye-tracking data?

Students working on this project will be:

  • Assisting with survey design (we need to pair questions to assess viewers comprehension and level of concern with their figure viewing data)
  • Assisting with human-subject eye-tracking experiments (collecting real human data)
  • Running saliency models
  • Assessing saliency model performance against human data from the eye-tracker and click-contingent software
  • Identifying possible ways to improve saliency models based on performance (transfer learning for deep networks?)

Project 4. Language Models for Policy Change (collaboration with Prof. Xanda Schofield, HMC CS): Environmental regulations in the United States are supposed to be based on the “best available science” as reviewed by the relevant regulatory agency (e.g. EPA). What makes a paper the “best”? If we want our work to matter, how should we present it to impact change? We are going to use tools from natural language processing to study the papers that influenced air quality regulation in the United States over the past 70 years to see what differentiates scientific papers cited by regulatory bodies from others in the same field and journals. Is it the topic of the paper? The novelty of the concept? Language usage? Something as simple as author prestige? Do the papers cited to motivate regulations represent what most scientists are working on or do the topics regulators target encourage more work in those areas? Can we identify features in papers that will make them more likely to impact policy?

Students working on this project will be:

  • Locating and downloading the EPA criteria reports that motivated the regulations over different stages of the Clean Air Act 
  • Identifying the paper citation style used in the reports and writing code to extract the citations from each document
  • Compiling a list of all the journals these papers were published in, the names of authors, and years of publication, to identify trends along with traditional publication metrics (Do the most cited journals change over time? Are they generally “high impact” journals? Are some authors frequently cited?)
  • Determining an appropriate sample size of papers to include to identify ‘typical’ papers in the representative journals and field
  • Determining if analysis should be run on full text of papers (maybe impossible!) or abstracts only
  • Use a diachronic analysis tool (like DRIFT) to track changes in topic, sentence complexity and readability, and appearance of novel topics in both the ‘regulatory cited’ and in the ‘typical of field’ data sets


Required Essay Prompt:

  1. Which project(s) are you interested in working on.
  2. Write two to three sentences on why you are interested in the specific projects (personal interest, aligned with future career goals, interested in the numerical techniques, etc.)
  3. What relevant experience have you had that will assist with getting started on your project (this can be relevant course work, past research or professional experience, or other things you think are relevant).
Name of research group, project, or lab
Logistics Information:
Project categories
Computer Science
Computer Vision
Data Science
Earth Science
Environmental Science
Machine Learning
Natural Language Processing
Natural Resources and Conservation
Student ranks applicable
Time commitment
Fall - Part Time
Academic Credit
Number of openings
Contact Information:
Sarah Kavassalis
Postdoctoral Scholar in Interdisciplinary Computation
Name of project director or principal investigator
Dr. Sarah Kavassalis
Email address of project director or principal investigator
10 sp. | 10 appl.
Hours per week
Fall - Part Time
Project categories
Chemistry (+8)
ChemistryComputer ScienceComputer VisionData ScienceEarth ScienceEnvironmental ScienceMachine LearningNatural Language ProcessingNatural Resources and Conservation