Improved Distribution Matching Distillation for Fast Image Synthesis
Authors
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman
Abstract
Recent approaches have shown promises distilling diffusion models into efficient one-step generators. Among them, Distribution Matching Distillation (DMD) produces one-step generators that match their teacher in distribution, without enforcing a one-to-one correspondence with the sampling trajectories of their teachers. However, to ensure stable training, DMD requires an additional regression loss computed using a large set of noise-image pairs generated by the teacher with many steps of a deterministic sampler. This is costly for large-scale text-to-image synthesis and limits the student's quality, tying it too closely to the teacher's original sampling paths. We introduce DMD2, a set of techniques that lift this limitation and improve DMD training. First, we eliminate the regression loss and the need for expensive dataset construction. We show that the resulting instability is due to the fake critic not estimating the distribution of generated samples accurately and propose a two time-scale update rule as a remedy. Second, we integrate a GAN loss into the distillation procedure, discriminating between generated samples and real images. This lets us train the student model on real data, mitigating the imperfect real score estimation from the teacher model, and enhancing quality. Lastly, we modify the training procedure to enable multi-step sampling. We identify and address the training-inference input mismatch problem in this setting, by simulating inference-time generator samples during training time. Taken together, our improvements set new benchmarks in one-step image generation, with FID scores of 1.28 on ImageNet-64x64 and 8.35 on zero-shot COCO 2014, surpassing the original teacher despite a 500X reduction in inference cost. Further, we show our approach can generate megapixel images by distilling SDXL, demonstrating exceptional visual quality among few-step methods.
Concepts
The Big Picture
Imagine teaching a student by having them watch an expert work through every single step of a complex problem, thousands of steps, repeated millions of times. Now imagine that student could distill all of that watching into a single, instantaneous answer that’s actually better than their teacher. That’s what a new technique called DMD2 accomplishes for AI image generation.
Diffusion models, the engine behind tools like Stable Diffusion and DALL-E, create images by starting with random noise and progressively refining it over dozens of cleanup steps. Each step means running the full AI model from scratch, which at high resolutions makes image generation slow and expensive. Researchers have been racing to compress these lumbering models into nimble one-step generators.
A team from MIT and Adobe Research has now solved this problem more cleanly than anyone before, achieving image quality that surpasses the original teacher model while cutting inference cost by 500x.
Key Insight: DMD2 trains a fast image generator to match the output of a slow, high-quality diffusion model. It augments this with adversarial training against real photos and a two-speed training schedule, eliminating the need for expensive pre-generated datasets and producing one-step generators that actually outperform the slow models they learned from.
How It Works
The original Distribution Matching Distillation (DMD) had a clever core idea: instead of teaching a student to mimic the teacher’s exact step-by-step path from noise to image, train it to match the teacher’s output distribution, the statistical profile of what good images look like. Think of it as learning to paint beautiful sunsets without memorizing a master’s exact brushstrokes, just the general character of what makes a sunset work.
But DMD had a weakness. To keep training stable, it required a regression loss computed from millions of pre-generated noise-image pairs, which forced the student to memorize specific teacher trajectories. Generating these pairs is expensive, and worse, it caps the student’s quality: it can never exceed what the teacher produces along those fixed paths.

DMD2 removes this requirement through three linked improvements:
Two Time-Scale Update Rule. Without the regression loss, training becomes unstable. The “fake critic” network (a separate model that monitors where the student tends to generate images) falls behind and gives inaccurate feedback. The fix: update the critic more frequently than the generator. Letting the critic catch up before each generator update keeps the feedback signal accurate, and training stabilizes without any pre-computed data.
GAN Loss on Real Images. The researchers added an adversarial objective borrowed from Generative Adversarial Networks (GANs), where a discriminator learns to tell fake images from real ones. The teacher model’s understanding of real data is inherently an imperfect approximation, but real images are ground truth. Training against actual photographs lets the student correct for the teacher’s errors and produce sharper results.
Backward Simulation for Multi-Step Sampling. The original DMD only supported one-step generation. DMD2 extends this to 2–4 steps, but doing so naively creates a training-inference mismatch: during training, the model sees inputs from one distribution, while during inference it sees inputs from a slightly different one. The solution runs the generator itself to produce the noisy intermediate samples used during training. The model then practices on exactly the kind of inputs it will encounter at test time.

Under the hood, DMD2 minimizes an approximate KL divergence between the student generator’s outputs and the teacher’s approximation of real images. The gradient of this divergence decomposes into two score functions: one pointing toward the real image distribution, one characterizing where the student currently generates. The GAN loss sharpens the first signal with actual ground-truth photos, while the two-timescale rule keeps the second signal accurate.
Why It Matters
DMD2 achieves an FID score of 1.28 on ImageNet-64×64 in one step. FID measures how close generated images are to real ones (lower is better). The original teacher diffusion model scores around 2.0 on the same benchmark. The student beats the teacher with 500x less compute.
On zero-shot COCO 2014, the standard text-to-image benchmark, DMD2 scores 8.35, again outperforming its teacher.

The team also distilled SDXL, one of the most capable open text-to-image models, showing that DMD2 scales to megapixel resolution. The 4-step SDXL distillation produces images with strong visual quality, competitive with or exceeding methods that require far more computation.
A 500x speedup doesn’t just make things faster. It makes applications feasible that were previously too expensive to consider, from real-time creative tools to large-scale synthetic dataset generation.
The broader lesson here is that matching the output of a complex system is often more tractable than mimicking its process. This principle could extend to video synthesis, 3D generation, and scientific simulation, anywhere a slow, high-quality model needs a fast approximation.
Bottom Line: DMD2 proves that one-step image generators don’t have to compromise quality. By combining two-timescale training, adversarial objectives on real data, and backward simulation, the student model exceeds its teacher at 1/500th the inference cost.
IAIFI Research Highlights
This work connects generative AI with theoretical ideas from optimal transport and divergence minimization, applying rigorous mathematical frameworks to the practical problem of making large-scale AI models computationally tractable.
DMD2 sets new records for one-step and few-step image generation, showing that distribution-matching distillation, augmented with adversarial training and two-timescale updates, can surpass teacher diffusion models at 500x lower inference cost.
Fast, high-quality generative models accelerate scientific workflows in physics and astronomy, where simulation-based inference and synthetic data generation are increasingly important tools for interpreting complex observational datasets.
Future work may extend DMD2's backward simulation and GAN-augmented distillation to video generation and 3D synthesis. The full approach is described in [arXiv:2405.14867](https://arxiv.org/abs/2405.14867).