Memorization to Generalization
Emergence of Diffusion Models from Associative Memory

RPI
JADS Research
Matteo Negri
Sapienza University of Rome
Radboud University
Dmitry Krotov
MIT-IBM Watson AI Lab
Teaser
Figure 1. Increasing empirical support changes the attractor structure of the denoising energy. Isolated minima retrieve stored samples, interacting memories produce spurious states, and dense support gives a broader low-energy structure aligned with the data geometry.

Our paper studies diffusion models as associative memories. In the optimal empirical denoising problem, the finite training set induces an energy landscape with a memory-like retrieval structure. With sparse empirical support, the stable states are individual training examples. As support increases, memory basins interfere and spurious states appear. These generated attractors are absent from the training set, and their emergence marks the boundary where sample-level retrieval gives way to the onset of generalization.

The finite-data problem

Diffusion models are commonly described as models of a data distribution. We study the finite-data version of this statement. Given a training set, what are the stable states of the generative dynamics induced by optimal denoising?

The question is an attractor question. A generated sample is not only an output of a learned sampler; it is a point reached by the denoising dynamics. If the stable states coincide with training examples, generation is sample-level retrieval. If stable states appear away from the training set, the model has left the pure memorization regime.

Question. How does the attractor structure of the empirical denoising energy change as the effective memory load increases?

Associative memory

Associative memories store patterns as minima of an energy function. Retrieval is motion toward lower energy. When the stored patterns are isolated, retrieval returns memories. When stored patterns interfere, the same energy can develop additional minima that were never stored.

Dense Associative Memories express this retrieval structure through a softmax competition between stored patterns:

$$ E_{\mathrm{DAM}}(x) = \frac{1}{2}\lVert x\rVert^2 - \frac{1}{\beta} \log \sum_{i=1}^{N} \exp\left(\beta x^\top y_i\right). $$

Here \(y_i\) are stored patterns and \(\beta\) controls retrieval selectivity. At high selectivity, one stored pattern dominates. At intermediate memory load, several stored patterns can shape the dynamics together. The resulting landscape may contain spurious minima: stable retrieval endpoints that do not correspond to stored patterns.

The empirical denoising energy

The diffusion connection appears in the optimal denoising problem for a finite empirical distribution. Let the training set be \(\{y_i\}_{i=1}^{N}\). Under Gaussian corruption, the noisy-data density is a Gaussian mixture:

$$ p_\sigma(x) = \frac{1}{N} \sum_{i=1}^{N} \mathcal{N}\left(x;\, a_\sigma y_i,\, \sigma^2 I\right). $$

The corresponding energy is \(E_\sigma(x) = -\log p_\sigma(x)\). Expanding the mixture gives

$$ E_\sigma(x) = \frac{\lVert x\rVert^2}{2\sigma^2} - \log \sum_{i=1}^{N} \exp\left( \frac{a_\sigma}{\sigma^2}x^\top y_i - \frac{a_\sigma^2}{2\sigma^2}\lVert y_i\rVert^2 \right) + C_\sigma . $$

Up to constants and norm-dependent corrections, this is the same log-sum-exp retrieval structure as Dense Associative Memory. For stored patterns with comparable norms, the effective inverse temperature is \(\beta_\sigma = a_\sigma / \sigma^2\). The optimal empirical denoising energy therefore acts as a softmax memory over the training examples.

Scope. This is a statement about the optimal empirical denoising energy. It does not claim that every trained neural network exactly realizes the analytic energy. It identifies the memory structure of the finite-data denoising problem.

Attractor regimes

Increasing the amount of empirical support changes the organization of local minima. Sparse support gives isolated sample-level basins. Intermediate support creates interference between basins. Dense support produces a broader low-energy structure aligned with the data geometry.

Two-sample memorization regime

Sample-level retrieval

With few examples, individual samples define stable basins. Generation retrieves stored data.

Intermediate spurious-state regime

Spurious retrieval

At the transition, memory basins interfere. New stable states appear away from the training samples.

Large-data generalization regime

Distribution-level structure

With denser support, the low-energy region is no longer organized only around individual observations.

The memorization-to-generalization transition is therefore an attractor transition. Memorization corresponds to local minima at stored examples. Spurious generation corresponds to local minima created by interference between stored examples. Generalization begins when sample-level retrieval ceases to dominate the generated distribution.

Spurious states

Spurious states are the transition object. In associative memory, they are stable endpoints of the retrieval dynamics that are not stored patterns. In the diffusion setting, they appear as generated samples that are absent from the training set but stable enough to recur under sampling.

This gives an operational distinction. A memorized sample is close to a training example. A spurious sample is far from the training set but has close neighbors among other generated samples. A generalized sample is neither a training-set copy nor a recurring generated attractor.

Diagnostic. The onset of generalization is detected not only by fewer training-set copies, but by the appearance of generated attractors away from the training data.

This is the statistical-physics content of the result. Spurious states are not sampling noise and not arbitrary artifacts. They are the expected unstable middle regime of a memory system whose stored patterns begin to interfere.

Low-dimensional geometry

In two dimensions, the attractors can be drawn directly. The red stars are training samples, the yellow crosses are attractor states, and the vector field shows the direction of retrieval. Darker regions correspond to lower energy.

Two-dimensional memorization, spurious, and generalization regimes
Figure 2. With sparse support, attractors coincide with stored examples. Near the transition, spurious attractors appear away from the training data. With dense support, the attractors follow a continuous low-energy structure.
Reference exact energy landscape
Reference landscape. The target distribution has continuous geometric structure. Generalization requires recovering this structure rather than only retrieving observed samples.
Energy transition from memorization to generalization
Energy transition. Increasing empirical support changes the landscape from isolated wells, through spurious minima, toward a connected low-energy region.

Image-space evidence

In image space, the attractors are not visible. They must be inferred from neighborhoods. Generated samples are compared to the training set and to one another. This separates sample-level retrieval, spurious retrieval, and distribution-level generation.

CIFAR-10 examples of memorization
Figure 3. Memorized samples are close to stored training examples. In the associative-memory view, they correspond to retrieval from isolated basins.
Memorized, spurious, and generalized image examples
Figure 4. Generated samples are classified by their relation to training data and to other generated samples: stored memories, spurious attractors, and generalized samples.

Statistical-physics interpretation

The associative-memory formulation identifies the transition as a change in attractor organization under increasing memory load. At low load, stored examples are isolated minima and generation behaves as memory retrieval. At the transition, memory basins interfere and spurious minima appear. Beyond pure retrieval, the low-energy set becomes increasingly governed by the geometry of the data distribution.

Low memory load

Stored examples remain isolated attractors. Sampling returns training examples.

Transition regime

Basins interfere. Spurious attractors appear away from the training set.

Generalizing regime

Generation is no longer dominated by sample-level retrieval.

Our paper does not treat spurious states as secondary artifacts. They are the transition signature. In classical associative memory, spurious states mark the failure of exact recall. In diffusion models, the same phenomenon marks the point where exact recall stops dominating and generative structure begins to emerge.

Generalization is therefore not simply the absence of memorization. It is a reorganization of the energy landscape: from isolated sample minima, through spurious attractors, toward a low-energy structure aligned with the data geometry.

Citation

@article{pham2025memorization,
  title={Memorization to Generalization: Emergence of Diffusion Models from Associative Memory},
  author={Pham, Bao and Raya, Gabriel and Negri, Matteo and Zaki, Mohammed J. and Ambrogioni, Luca and Krotov, Dmitry},
  journal={arXiv preprint arXiv:2505.21777},
  year={2025},
  url={https://arxiv.org/abs/2505.21777}
}