Skip to the content.

Diffusion models have revolutionized generative modeling, particularly in producing high-quality images. Yet, when faced with complex prompts involving multiple objects or conditions, they often stumble: generated images might satisfy only part of a prompt or contain inconsistencies.

In our ICML 2025 paper, we introduce CompLift—a lightweight, training-free rejection criterion using lift scores—to address this issue.

Table of Contents:

  1. Sometimes, Compositional Prompts Break Diffusion Models
  2. What is CompLift?
    1. Lift Scores, Explained
    2. Compositional Logic via Lift
    3. Efficient Implementation
    4. Connection to Classifier-Free Guidance
  3. Results
  4. Lift in the Latent Space
  5. Correlation between Lift Scores and CLIP Scores
  6. Get Started

🤔 Sometimes, Compositional Prompts Break Diffusion Models

Compositional prompts require satisfying multiple conditions simultaneously. Some diffusion models, e.g., Stable Diffusion, trained on rich datasets tend to focus on the objects mentioned with an earlier order in the prompt. As a result, given a prompt like:

“a black car and a white clock”

The model might tend to generate a black car, and ignore the white clock. See the example below - here we have 100 samples generated by Stable Diffusion XL. The 62 blue-bordered samples indicate that they satisfy the constraint to include a white clock, while the 38 orange-bordered samples do not include a white clock.

Some previous works, e.g., Attend & Excite, hypothesize that this happens because the model is biased towards some parts of the prompt, while ignoring others.

Our goal is to design a rejection criterion that can detect whether a generated sample satisfies each component of a compositional prompt, and reject it if it does not.

🚀 What is CompLift?

CompLift is a plug-and-play resampling technique that checks if a generated sample truly satisfies a prompt—without retraining or using any extra models. It leverages a classic statistical idea: lift scores.

🔍 Lift Scores, Explained

Lift measures how much a condition \(c\) influences the probability of generating a sample \(x\). Formally:

\[\begin{equation} \text{lift}(x|c) = \log\frac{p(x|c)}{p(x)} \tag{1} \label{eqn:lift} \end{equation}\]

In practice, we show that we can approximate Eq. \eqref{eqn:lift} using the internal denoising predictions of the diffusion model itself. Similar techniques can be found in Diffusion Classifier.

$$ \begin{equation} \text{lift}(x|c) \approx \mathbb{E}_{t,\epsilon}\{||\epsilon-\epsilon_\theta(x_t, \varnothing)||^2-||\epsilon-\epsilon_\theta(x_t, c)||^2\} \tag{2} \label{eqn:lift-approx} \end{equation} $$

Note:

Intuitively, Eq. \eqref{eqn:lift-approx} means:

If a sample aligns with the condition, the model should be better at denoising it when given that condition.

This allows us to check conditions after generation, and reject samples that don’t satisfy them. Note that this is training-free and does not require any extra models.

🧠 Compositional Logic via Lift

CompLift handles logical combinations of conditions:

Type Algebra Acceptance Criterion
Product \(c_1 \wedge c_2\) \(\min_{i\in[1,2]}\text{lift}(x|c_i)> 0\)
Mixture \(c_1 \vee c_2\) \(\max_{i\in[1,2]}\text{lift}(x|c_i)> 0\)
Negation \(\neg c_1\) \(\text{lift}(x|c_1)\leq 0\)

Table 1: Examples of composition rules for multiple conditions.

This compositional approach lets us flexibly combine prompts into logical structures! We call this compositional criteria as CompLift. Here we show some examples in 2D synthetic tasks. Here, we use the Composable Diffusion as the baseline. + CompLift means we use CompLift to reject the samples that do not satisfy the criteria among the Composable Diffusion samples.

Component A
Component B

Product Example

Sample 1
\(\times\)
Sample 2
=
Ground Truth
Baseline
With CompLift

Mixture Example

Sample 1
\(+\)
Sample 2
=
Ground Truth
Baseline
With CompLift

Negation Example

Sample 1
\(-\)
Sample 2
=
Ground Truth
Baseline
With CompLift

⚙️ Efficient Implementation

In our paper, we investigate several design choices for CompLift. A TL;DR version is:

  1. Sharing sampled \((\epsilon, t)\) pairs between the estimation of \(p(x)\) and \(p(x \\| c)\) is most efficient.
  2. Importance sampling on \(t\) is required when the original diffusion model \(\epsilon_\theta\) is trained with importance sampling.
  3. We can reuse the calculation of \(\epsilon_\theta(x_t, c)\), that are computed when \(x\) is generated, to save compute when estimating \(\text{lift}\).

📝 Connection to Classifier-Free Guidance

One would notice that the above implementation is similar to the classifier-free guidance technique (CFG) - we have \(\epsilon_\theta(x_t, c)\) and we compare it with \(\epsilon_\theta(x_t, \varnothing)\). In the appendix of our paper, we show that they are dual to each other from the Lagrangian perspective.

CompLift solves the primal problem - the sampler tries to generate samples that satisfy the ground truth condition.

$$ \begin{aligned} & x_0 \sim p_{\text{generator}}(x_0), \quad \text{s.t. } p(x_0 \mid c_i) > p(x_0), \quad \forall c_i, \\ \Leftrightarrow \quad & x_0 \sim p_{\text{generator}}(x_0), \quad \text{s.t. } \text{lift}(x_0 \mid c_i) > 0, \quad \forall c_i. \end{aligned} $$

CFG / Composable Diffusion solves the dual problem - the sampler tries to generate samples that are close to the ground truth condition and keeps a balance using the Lagrangian coefficient \(\lambda_i\).

$$ \begin{aligned} & \mathcal{L}(x_0, \lambda) = \log p_{\text{generator}}(x_0) + \sum_{c_i} \lambda_i \Bigl( \log p(x_0 \mid c_i) - \log p(x_0) \Bigr), \quad \lambda_i \geq 0, \\ \Rightarrow & \nabla_{x_t} \mathcal{L}(x_0, \lambda) \approx \epsilon_\theta(x_t, \varnothing) + \sum_{c_i} \lambda_i \Bigl( \epsilon_\theta(x_t, c_i) - \epsilon_\theta(x_t, \varnothing) \Bigr), \quad \lambda_i \geq 0. \end{aligned} $$

Note:

🧪 Results

We evaluate CompLift on:

Across the board, CompLift significantly improves prompt alignment without harming quality. More details can be found in our paper.

Task Metric Baseline +CompLift
2D Synthetic Task Accuracy 45.0% 92.9%
CLEVR Position (5 objects) Accuracy 78.7% 90.3%
SD 2.1 CLIP Score 0.342 0.352

Here we show some examples of the generated samples with Stable Diffusion XL. Blue-bordered images are accepted by the CompLift, while orange-bordered images are rejected.

In the examples above, CompLift shows great promise in improving prompt alignment. However, some limitations of CompLift can also be perceived:

🔬 Lift in the Latent Space

To handle fine-grained image prompts, we compute lift scores per pixel in the latent space, allowing us to detect whether each object is truly present. This even helps in understanding which part of the image aligns with each prompt component.

Here are some examples showing how lift scores help identify different objects in complex scenes:

The brighter regions in each visualization indicate where the lift scores are higher for that particular object or concept, showing how CompLift can identify the spatial location of each component in the prompt. Pixels with negative lift scores are masked and shown in black.

📊 Correlation between Lift Scores and CLIP Scores

We investigate the empirical correlation between the lift scores and the CLIP scores of the generated images. Here we evaluate the CLIP scores and the number of activated pixels (pixels with positive lift scores) for each generated image with the prompt “a black car and a white clock”.

💡 Mobile Tip: Tap any point to view the enlarged image. On mobile, you can scroll horizontally to see the full chart.

🧰 Get Started

Want to try it? Check out our GitHub repo!

git clone https://github.com/rainorangelemon/complift-t2i
cd complift-t2i
python run_lift.py --prompt "a black car and a white clock"

🧩 Final Thoughts

CompLift shows how diffusion models can be made compositional—by introspecting their own denoising behavior. It’s simple, elegant, and practical.

While we’ve focused on image generation, the principles behind lift scores may apply to video, music, or even language generation in the future.

📄 Paper

Chenning Yu and Sicun Gao. Improving Compositional Generation with Diffusion Models Using Lift Scores. ICML 2025. [ArXiv] [GitHub]

@inproceedings{yu2025improving,
 title={Improving Compositional Generation with Diffusion Models Using Lift Scores},
 author={Yu, Chenning and Gao, Sicun},
 booktitle={Proceedings of the 42nd International Conference on Machine Learning (ICML)},
 year={2025}
}