Quantum Many-Body Physics Calculations with Large Language Models
Authors
Haining Pan, Nayantara Mudur, Will Taranto, Maria Tikhanovskaya, Subhashini Venugopalan, Yasaman Bahri, Michael P. Brenner, Eun-Ah Kim
Abstract
Large language models (LLMs) have demonstrated an unprecedented ability to perform complex tasks in multiple domains, including mathematical and scientific reasoning. We demonstrate that with carefully designed prompts, LLMs can accurately carry out key calculations in research papers in theoretical physics. We focus on a broadly used approximation method in quantum physics: the Hartree-Fock method, requiring an analytic multi-step calculation deriving approximate Hamiltonian and corresponding self-consistency equations. To carry out the calculations using LLMs, we design multi-step prompt templates that break down the analytic calculation into standardized steps with placeholders for problem-specific information. We evaluate GPT-4's performance in executing the calculation for 15 research papers from the past decade, demonstrating that, with correction of intermediate steps, it can correctly derive the final Hartree-Fock Hamiltonian in 13 cases and makes minor errors in 2 cases. Aggregating across all research papers, we find an average score of 87.5 (out of 100) on the execution of individual calculation steps. Overall, the requisite skill for doing these calculations is at the graduate level in quantum condensed matter theory. We further use LLMs to mitigate the two primary bottlenecks in this evaluation process: (i) extracting information from papers to fill in templates and (ii) automatic scoring of the calculation steps, demonstrating good results in both cases. The strong performance is the first step for developing algorithms that automatically explore theoretical hypotheses at an unprecedented scale.
Concepts
The Big Picture
Imagine hiring a research assistant who can sit down with a graduate-level physics textbook, follow a multi-page derivation filled with specialized mathematical symbols, and produce the correct answer on the first try. Now imagine that assistant is a language model trained mostly on internet text.
For decades, theoretical physicists have relied on Hartree-Fock mean-field theory, a method for calculating how electrons push and pull on each other inside exotic quantum materials. Mastering it takes years of graduate study. The calculation demands physical intuition, abstract algebra, specialized notation, and a multi-step logic where one wrong sign ruins everything downstream. It is exactly the kind of task that AI skeptics would call fundamentally human.
A team from Cornell, Google Research, and Harvard decided to test that assumption head-on, and GPT-4 did surprisingly well.
Key Insight: With carefully structured prompts, GPT-4 can execute graduate-level quantum physics derivations with an average score of 87.5 out of 100, correctly completing the full Hartree-Fock calculation in 13 out of 15 real research papers.
How It Works
The team’s central insight was that the Hartree-Fock calculation, however intimidating, is actually structured. It follows a reproducible sequence of steps that any quantum physicist working on materials would recognize. So they built the HF template: a multi-step prompt framework that breaks the derivation into standardized stages, with placeholders for the problem-specific details of each paper.
Think of it like a tax form for quantum physics. The structure stays the same every time; you fill in the numbers that change. Each prompt step tells GPT-4 exactly what to calculate next, and the model supplies the physics-specific content for whatever system is under study.

The derivation unfolds across five main stages:
- Establish the Hilbert space — identify the particle flavors (spin, orbital, valley, layer) and write the non-interacting Hamiltonian. The Hilbert space is the mathematical space of all possible quantum states; the Hamiltonian encodes the system’s total energy.
- Fourier transform — convert the Hamiltonian to momentum space. A Fourier transform re-expresses physical information in terms of particle momenta rather than positions, making subsequent steps cleaner.
- Apply Wick’s theorem — decompose the interaction Hamiltonian by replacing four-operator products with mean-field averages. In plain terms: a shortcut that substitutes the impossible math of four particles interacting simultaneously with a simpler average-field picture.
- Organize Hartree and Fock terms — sort the resulting energy equation into its two components: direct electron-electron repulsion (Hartree) and exchange interaction (Fock).
- Reveal symmetry structure — use the system’s symmetries to identify the order parameter, the quantity that measures how “ordered” the material is (the way magnetization measures how magnetic something is) and signals what phase it might enter.

The researchers assembled a benchmark of 15 real physics papers from the past decade, spanning twisted bilayer graphene, topological insulators, Moiré systems, and more. For each paper, they filled in the templates with problem-specific information and asked GPT-4 to execute the derivation step by step. The model correctly produced the final Hartree-Fock Hamiltonian in 13 of 15 cases. The two failures involved minor errors, not catastrophic breakdowns.
They also pushed the automation further. Rather than manually extracting information from each paper, they asked GPT-4 to read an abstract and answer ten targeted questions, filling in its own placeholders. The model performed well here too, correctly inferring notation and physical assumptions from a few sentences of text.
Why It Matters
Most AI benchmarks test whether models know facts about physics. This paper tests whether a model can do physics: executing a multi-step analytical calculation at the level expected of a first- or second-year PhD student. For this class of calculation, the answer is yes.

The real payoff is about throughput. Hartree-Fock gets applied to hundreds of new quantum systems every year. Each application currently requires a skilled human to spend days on the derivation. An AI system that reliably automates that step doesn’t just save time. It opens the door to systematically scanning theoretical hypothesis spaces that would otherwise be inaccessible. New quantum materials are discovered faster than they can be theoretically characterized; reliable automation changes the math on that bottleneck.
Open questions remain. The study uses a human-in-the-loop correction scheme where intermediate errors get caught before they propagate. Fully autonomous derivation is harder. And Hartree-Fock, while ubiquitous, is one framework among many. Whether LLMs can handle more exotic methods (diagrammatic perturbation theory, renormalization group flows, tensor network contractions) is still an open challenge.
Bottom Line: GPT-4 can execute graduate-level quantum physics calculations with near-expert accuracy when given well-structured prompts. The analytic workhorse calculations of theoretical physics look ripe for AI-assisted automation, which could transform how quickly theorists explore new quantum materials.
IAIFI Research Highlights
This work sits squarely at the intersection of AI and quantum condensed matter theory. By showing that LLMs can reliably execute the Hartree-Fock derivation at graduate-student level, it opens a new direction for AI-assisted theoretical research.
The paper introduces a structured prompt-template methodology for multi-step analytical reasoning, showing that task decomposition and placeholder-based prompting dramatically improve LLM performance on complex, symbol-heavy scientific calculations.
Automating Hartree-Fock calculations across 15 diverse quantum materials systems creates a path toward large-scale computational exploration of symmetry-breaking phases and emergent quantum order in correlated electron systems.
Future work targets other calculational frameworks and fully autonomous derivation without intermediate human correction. The paper is available at [arXiv:2403.03154](https://arxiv.org/abs/2403.03154) and represents an early step toward AI systems that can generate and test theoretical physics hypotheses on their own.
Original Paper Details
Quantum Many-Body Physics Calculations with Large Language Models
[2403.03154](https://arxiv.org/abs/2403.03154)
Haining Pan, Nayantara Mudur, Will Taranto, Maria Tikhanovskaya, Subhashini Venugopalan, Yasaman Bahri, Michael P. Brenner, Eun-Ah Kim
Large language models (LLMs) have demonstrated an unprecedented ability to perform complex tasks in multiple domains, including mathematical and scientific reasoning. We demonstrate that with carefully designed prompts, LLMs can accurately carry out key calculations in research papers in theoretical physics. We focus on a broadly used approximation method in quantum physics: the Hartree-Fock method, requiring an analytic multi-step calculation deriving approximate Hamiltonian and corresponding self-consistency equations. To carry out the calculations using LLMs, we design multi-step prompt templates that break down the analytic calculation into standardized steps with placeholders for problem-specific information. We evaluate GPT-4's performance in executing the calculation for 15 research papers from the past decade, demonstrating that, with correction of intermediate steps, it can correctly derive the final Hartree-Fock Hamiltonian in 13 cases and makes minor errors in 2 cases. Aggregating across all research papers, we find an average score of 87.5 (out of 100) on the execution of individual calculation steps. Overall, the requisite skill for doing these calculations is at the graduate level in quantum condensed matter theory. We further use LLMs to mitigate the two primary bottlenecks in this evaluation process: (i) extracting information from papers to fill in templates and (ii) automatic scoring of the calculation steps, demonstrating good results in both cases. The strong performance is the first step for developing algorithms that automatically explore theoretical hypotheses at an unprecedented scale.