Modeling Galaxy Surveys with Hybrid SBI

The Big Picture

Imagine trying to understand a city by looking at it from two different altitudes at once. A satellite view captures neighborhoods and highways. A street-level view reveals individual storefronts and alleyways. Each perspective shows things the other misses. Now imagine the street-level view is extraordinarily expensive to produce, so expensive that you can only afford to photograph a few city blocks. That’s roughly the challenge facing cosmologists trying to map the universe using galaxy surveys.

Telescopes like DESI, Euclid, and the upcoming Rubin Observatory are producing galaxy maps of staggering scale, cataloging hundreds of millions of objects spread across cosmic history. But squeezing every drop of information out of those maps requires modeling physics at wildly different scales simultaneously. On large scales, the universe is relatively tame: galaxies cluster predictably, and well-established mathematical formulas handle it well. On small scales, gravity has done its work in earnest, matter has collapsed into dense clumps, and the physics becomes so tangled that only expensive computer simulations can track it.

A new paper from researchers at Harvard, NYU, Columbia, and Stanford lays out a practical roadmap for combining the best of both worlds into a single framework called Hybrid Simulation-Based Inference (HySBI), now extended for the first time to the galaxy clustering data that real surveys actually produce.

Key Insight: By splitting the analysis into a perturbation-theory treatment of large scales and a machine-learning-powered simulation analysis of small scales, HySBI extracts 60% tighter constraints on a key measure of cosmic structure, without needing the prohibitively expensive simulations that a pure ML approach would demand.

How It Works

Don’t use the same tool for every scale. The researchers carve the observable universe into two regimes and apply the best available method to each.

On large scales, they deploy the Effective Field Theory of Large-Scale Structure (EFT-LSS), a well-tested analytic framework that computes how galaxies cluster at cosmic distances. This produces a theoretical prediction for the large-scale galaxy power spectrum (how much clustering occurs at each spatial frequency), parametrized by galaxy bias parameters that capture how galaxies trace the underlying dark matter distribution.

On small scales, the team uses simulation-based inference (SBI), a machine learning technique built on neural density estimation. SBI works by:

Running a suite of smaller “subbox” simulations (compact patches of simulated universe) at many different cosmological parameter settings
Training a neural network to learn the relationship between summary statistics computed from those simulations and the underlying parameters
Using that trained network to evaluate how likely any given set of observations is, given a particular cosmology

The small-scale simulations use a Halo Occupation Distribution (HOD) model to populate dark matter halos with galaxies. Halos are the invisible knots of gravitational scaffolding where galaxies tend to form, and the HOD model introduces nuisance parameters that must be constrained alongside the cosmology.

What makes this more than a simple divide-and-conquer is how the two halves communicate. The parameters governing galaxies on large scales (EFT bias parameters) and small scales (HOD parameters) are not independent; both respond to the same underlying galaxy physics. The team shows that simulation-informed priors can constrain the EFT bias parameters, letting the large-scale analysis benefit from physical insights extracted from small-scale simulations. The combined posterior comes from multiplying the large-scale analytic likelihood with the small-scale neural likelihood, producing a coherent picture of what the full galaxy map implies about the universe.

Why It Matters

When the team runs HySBI on mock galaxy observations designed to mimic DESI-like surveys, the improvement over traditional perturbation theory alone is clear. HySBI achieves 20% tighter constraints on Ω_m (the fraction of the universe’s energy budget made up by matter) and 60% tighter constraints on σ_8, a measure of how clumped matter is on large scales. That’s not a modest improvement. For some of the most contested questions in fundamental physics, it’s the difference between ambiguity and resolution.

The gain on σ_8 matters especially. The so-called “S8 tension,” a persistent discrepancy between σ_8 measurements from galaxy surveys and from the cosmic microwave background, is one of the most active fault lines in cosmology today. Tools that can pin down σ_8 with greater precision are exactly what the field needs.

The timing is no accident. DESI is already collecting data. Euclid is in orbit. The Rubin Observatory will soon produce the deepest and widest optical survey ever made. These experiments will generate datasets so large that analysis sophistication becomes the bottleneck, not data quality.

Pure SBI approaches require simulations that match the full survey volume, which grows computationally impractical as surveys scale up. HySBI sidesteps this by confining expensive simulations to small subvolumes and reserving full-volume coverage for analytic perturbation theory. That makes it scalable in a way pure simulation approaches are not.

The extension to galaxy clustering, rather than the dark matter fields studied in earlier HySBI work, is a necessary step toward real-world use. Galaxies are what telescopes actually observe, and every galaxy catalog carries imprints of astrophysical processes that must be modeled as nuisance parameters rather than ignored. This paper shows HySBI handles that complexity without sacrificing the cosmological signal.

Bottom Line: HySBI for galaxy clustering combines the interpretability of analytic theory with the raw power of machine learning, delivering tighter constraints on fundamental cosmological parameters at a computational cost that scales to the next generation of survey data.

IAIFI Research Highlights

Interdisciplinary Research Achievement: This work connects modern machine learning (neural density estimation for simulation-based inference) with analytical physics (effective field theory of large-scale structure), showing that the two approaches produce stronger results together than either does alone.

Impact on Artificial Intelligence: The paper advances simulation-based inference methodology by showing how neural posteriors can be conditioned on analytic likelihoods and how physically motivated priors from simulations can transfer across modeling frameworks.

Impact on Fundamental Interactions: HySBI delivers 20% and 60% tighter constraints on Ω_m and σ_8, respectively, offering a scalable path to resolving ongoing cosmological tensions using data from next-generation surveys like DESI and Euclid.

Outlook and References: The next steps point toward applying HySBI to real survey data and incorporating more sophisticated small-scale summary statistics; the paper is available at arXiv:2505.13591 (Zhang, Modi & Philcox, 2025).

Modeling Galaxy Surveys with Hybrid SBI

Authors

Abstract

Concepts

The Big Picture

How It Works

Why It Matters

IAIFI Research Highlights