Diffusion models have revolutionized generative modeling, particularly in producing high-quality images. Yet, when faced with complex prompts involving multiple objects or conditions, they often stumble: generated images might satisfy only part of a prompt or contain inconsistencies.

In our ICML 2025 paper, we introduce CompLift—a lightweight, training-free rejection criterion using lift scores—to address this issue.

The model might tend to generate a black car, and ignore the white clock. See the example below - here we have 100 samples generated by Stable Diffusion XL. The 62 blue-bordered samples indicate that they satisfy the constraint to include a white clock, while the 38 orange-bordered samples do not include a white clock.

Some previous works, e.g., Attend & Excite, hypothesize that this happens because the model is biased towards some parts of the prompt, while ignoring others.

Our goal is to design a rejection criterion that can detect whether a generated sample satisfies each component of a compositional prompt, and reject it if it does not.

🚀 What is CompLift?

CompLift is a plug-and-play resampling technique that checks if a generated sample truly satisfies a prompt—without retraining or using any extra models. It leverages a classic statistical idea: lift scores.

🔍 Lift Scores, Explained

Lift measures how much a condition $c$ influences the probability of generating a sample $x$. Formally:

\[\begin{equation} \text{lift}(x|c) = \log\frac{p(x|c)}{p(x)} \tag{1} \label{eqn:lift} \end{equation}\]

In practice, we show that we can approximate Eq. \eqref{eqn:lift} using the internal denoising predictions of the diffusion model itself. Similar techniques can be found in Diffusion Classifier.

$$ \begin{equation} \text{lift}(x|c) \approx \mathbb{E}_{t,\epsilon}\{||\epsilon-\epsilon_\theta(x_t, \varnothing)||^2-||\epsilon-\epsilon_\theta(x_t, c)||^2\} \tag{2} \label{eqn:lift-approx} \end{equation} $$

Note:

$\epsilon_\theta$ is the diffusion model
$t$ is randomly sampled diffusion steps
$\epsilon \sim \mathcal{N}(0, I)$ is the randomly sampled noise
$x_t$ is the sample at step $t$, which is a noisy version of $x_0$

Intuitively, Eq. \eqref{eqn:lift-approx} means:

If a sample aligns with the condition, the model should be better at denoising it when given that condition.

This allows us to check conditions after generation, and reject samples that don’t satisfy them. Note that this is training-free and does not require any extra models.

🧠 Compositional Logic via Lift

CompLift handles logical combinations of conditions:

AND (Product): Accept only if all subconditions have positive lift.
OR (Mixture): Accept if any subcondition has positive lift.
NOT (Negation): Reject if a subcondition is satisfied.

Type	Algebra	Acceptance Criterion
Product	$c_1 \wedge c_2$	$\min_{i\in[1,2]}\text{lift}(x\|c_i)> 0$
Mixture	$c_1 \vee c_2$	$\max_{i\in[1,2]}\text{lift}(x\|c_i)> 0$
Negation	$\neg c_1$	$\text{lift}(x\|c_1)\leq 0$

Table 1: Examples of composition rules for multiple conditions.

This compositional approach lets us flexibly combine prompts into logical structures! We call this compositional criteria as CompLift. Here we show some examples in 2D synthetic tasks. Here, we use the Composable Diffusion as the baseline. + CompLift means we use CompLift to reject the samples that do not satisfy the criteria among the Composable Diffusion samples.

Component A

Component B

Ground Truth

Composable Diffusion

+ CompLift

Product Example

$\times$

Mixture Example

$+$

Negation Example

$-$

⚙️ Efficient Implementation

In our paper, we investigate several design choices for CompLift. A TL;DR version is:

Sharing sampled $(\epsilon, t)$ pairs between the estimation of $p(x)$ and $p(x \\| c)$ is most efficient.
Importance sampling on $t$ is required when the original diffusion model $\epsilon_\theta$ is trained with importance sampling.
We can reuse the calculation of $\epsilon_\theta(x_t, c)$, that are computed when $x$ is generated, to save compute when estimating $\text{lift}$.

📝 Connection to Classifier-Free Guidance

One would notice that the above implementation is similar to the classifier-free guidance technique (CFG) - we have $\epsilon_\theta(x_t, c)$ and we compare it with $\epsilon_\theta(x_t, \varnothing)$. In the appendix of our paper, we show that they are dual to each other from the Lagrangian perspective.

CompLift solves the primal problem - the sampler tries to generate samples that satisfy the ground truth condition.

$$ \begin{aligned} & x_0 \sim p_{\text{generator}}(x_0), \quad \text{s.t. } p(x_0 \mid c_i) > p(x_0), \quad \forall c_i, \\ \Leftrightarrow \quad & x_0 \sim p_{\text{generator}}(x_0), \quad \text{s.t. } \text{lift}(x_0 \mid c_i) > 0, \quad \forall c_i. \end{aligned} $$

CFG / Composable Diffusion solves the dual problem - the sampler tries to generate samples that are close to the ground truth condition and keeps a balance using the Lagrangian coefficient $\lambda_i$.

$$ \begin{aligned} & \mathcal{L}(x_0, \lambda) = \log p_{\text{generator}}(x_0) + \sum_{c_i} \lambda_i \Bigl( \log p(x_0 \mid c_i) - \log p(x_0) \Bigr), \quad \lambda_i \geq 0, \\ \Rightarrow & \nabla_{x_t} \mathcal{L}(x_0, \lambda) \approx \epsilon_\theta(x_t, \varnothing) + \sum_{c_i} \lambda_i \Bigl( \epsilon_\theta(x_t, c_i) - \epsilon_\theta(x_t, \varnothing) \Bigr), \quad \lambda_i \geq 0. \end{aligned} $$

Note:

CFG / Composable Diffusion often sets the Lagrangian coefficient $\lambda_i$ as a fixed weight value $w$.
The choice of $p_{\text{generator}}(x_0)$ can vary. CompLift mainly uses $p_{\text{generator}}(x_0)$ as $p_\theta(x_0, c_{\text{prompt}})$, while CFG / Composable Diffusion often regards $p_{\text{generator}}(x_0)$ as $p_\theta(x_0, \varnothing)$.

🧪 Results

We evaluate CompLift on:

🎨 2D synthetic tasks
🧱 CLEVR object positioning
🖼 Text-to-image generation (SD 1.4/2.1/XL)

Across the board, CompLift significantly improves prompt alignment without harming quality. More details can be found in our paper.

Task	Metric	Baseline	+CompLift
2D Synthetic Task	Accuracy	45.0%	92.9%
CLEVR Position (5 objects)	Accuracy	78.7%	90.3%
SD 2.1	CLIP Score	0.342	0.352

Here we show some examples of the generated samples with Stable Diffusion XL. Blue-bordered images are accepted by the CompLift, while orange-bordered images are rejected.

Turtle and clock 1 — a turtle and a blue clock

Turtle and clock 2 — a turtle and a blue clock

Backpack and car 1 — an orange backpack and a purple car

Backpack and car 2 — an orange backpack and a purple car

Elephant with glasses 1 — an elephant with glasses

Elephant with glasses 2 — an elephant with glasses

Crown and car 1 — a black crown and a red car

Crown and car 2 — a black crown and a red car

Red backpack and yellow bowl 1 — a red backpack and a yellow bowl

Red backpack and yellow bowl 2 — a red backpack and a yellow bowl

Black car and white clock 1 — a black car and a white clock

Black car and white clock 2 — a black car and a white clock

In the examples above, CompLift shows great promise in improving prompt alignment. However, some limitations of CompLift can also be perceived:

Since the ELBO estimation is an approximation, the lift score is not always accurate. This leads to some samples that are rejected but should be accepted, and vice versa (e.g., 1st image in “a frog with a bow” and 2nd image in “a lion with a bow”).
Too small objects tend to be rejected, e.g., the 2nd image in “a red backpack and a yellow bowl”.
Color-based conditions are not always identified effectively, e.g., the 5th image in “an orange backpack and a purple car”.

🔬 Lift in the Latent Space

To handle fine-grained image prompts, we compute lift scores per pixel in the latent space, allowing us to detect whether each object is truly present. This even helps in understanding which part of the image aligns with each prompt component.

Here are some examples showing how lift scores help identify different objects in complex scenes:

Original Image

Latent Space

Component A

Component B

Component C

Original: teacup, book, and candle — a steaming teacup beside an open book and a candle

Latent representation — a steaming teacup beside an open book and a candle

Original: bottle, message, and starfish — a glass bottle with a message drifting past starfish

Original: ladder, treehouse, and stars — a wooden ladder reaching into a treehouse under stars

Original: gramophone, windowsill, and leaves — an old gramophone on a windowsill with falling autumn leaves

The brighter regions in each visualization indicate where the lift scores are higher for that particular object or concept, showing how CompLift can identify the spatial location of each component in the prompt. Pixels with negative lift scores are masked and shown in black.

📊 Correlation between Lift Scores and CLIP Scores

We investigate the empirical correlation between the lift scores and the CLIP scores of the generated images. Here we evaluate the CLIP scores and the number of activated pixels (pixels with positive lift scores) for each generated image with the prompt “a black car and a white clock”.

💡 Mobile Tip: Tap any point to view the enlarged image. On mobile, you can scroll horizontally to see the full chart.

🧰 Get Started

Want to try it? Check out our GitHub repo!

git clone https://github.com/rainorangelemon/complift-t2i
cd complift-t2i
python run_lift.py --prompt "a black car and a white clock"

🧩 Final Thoughts

CompLift shows how diffusion models can be made compositional—by introspecting their own denoising behavior. It’s simple, elegant, and practical.

While we’ve focused on image generation, the principles behind lift scores may apply to video, music, or even language generation in the future.

📄 Paper

Chenning Yu and Sicun Gao. Improving Compositional Generation with Diffusion Models Using Lift Scores. ICML 2025. [ArXiv] [GitHub]
@inproceedings{yu2025improving,
 title={Improving Compositional Generation with Diffusion Models Using Lift Scores},
 author={Yu, Chenning and Gao, Sicun},
 booktitle={Proceedings of the 42nd International Conference on Machine Learning (ICML)},
 year={2025}
}

Improving Compositional Generation with Diffusion Models Using Lift Scores

a compositional criteria for rejection and resampling of samples generated by diffusion models

Table of Contents:

🤔 Sometimes, Compositional Prompts Break Diffusion Models