Diffusion models have revolutionized generative modeling, particularly in producing high-quality images. Yet, when faced with complex prompts involving multiple objects or conditions, they often stumble: generated images might satisfy only part of a prompt or contain inconsistencies.
In our ICML 2025 paper, we introduce CompLift—a lightweight, training-free rejection criterion using lift scores—to address this issue.
Table of Contents:
- Sometimes, Compositional Prompts Break Diffusion Models
- What is CompLift?
- Results
- Lift in the Latent Space
- Correlation between Lift Scores and CLIP Scores
- Get Started
🤔 Sometimes, Compositional Prompts Break Diffusion Models
Compositional prompts require satisfying multiple conditions simultaneously. Some diffusion models, e.g., Stable Diffusion, trained on rich datasets tend to focus on the objects mentioned with an earlier order in the prompt. As a result, given a prompt like:
“a black car and a white clock”
The model might tend to generate a black car, and ignore the white clock. See the example below - here we have 100 samples generated by Stable Diffusion XL. The 62 blue-bordered samples indicate that they satisfy the constraint to include a white clock, while the 38 orange-bordered samples do not include a white clock.
Some previous works, e.g., Attend & Excite, hypothesize that this happens because the model is biased towards some parts of the prompt, while ignoring others.
Our goal is to design a rejection criterion that can detect whether a generated sample satisfies each component of a compositional prompt, and reject it if it does not.
🚀 What is CompLift?
CompLift is a plug-and-play resampling technique that checks if a generated sample truly satisfies a prompt—without retraining or using any extra models. It leverages a classic statistical idea: lift scores.
🔍 Lift Scores, Explained
Lift measures how much a condition \(c\) influences the probability of generating a sample \(x\). Formally:
\[\begin{equation} \text{lift}(x|c) = \log\frac{p(x|c)}{p(x)} \tag{1} \label{eqn:lift} \end{equation}\]In practice, we show that we can approximate Eq. \eqref{eqn:lift} using the internal denoising predictions of the diffusion model itself. Similar techniques can be found in Diffusion Classifier.
Note:
- \(\epsilon_\theta\) is the diffusion model
- \(t\) is randomly sampled diffusion steps
- \(\epsilon \sim \mathcal{N}(0, I)\) is the randomly sampled noise
- \(x_t\) is the sample at step \(t\), which is a noisy version of \(x_0\)
Intuitively, Eq. \eqref{eqn:lift-approx} means:
If a sample aligns with the condition, the model should be better at denoising it when given that condition.
This allows us to check conditions after generation, and reject samples that don’t satisfy them. Note that this is training-free and does not require any extra models.
🧠 Compositional Logic via Lift
CompLift handles logical combinations of conditions:
- AND (Product): Accept only if all subconditions have positive lift.
- OR (Mixture): Accept if any subcondition has positive lift.
- NOT (Negation): Reject if a subcondition is satisfied.
Type | Algebra | Acceptance Criterion |
---|---|---|
Product | \(c_1 \wedge c_2\) | \(\min_{i\in[1,2]}\text{lift}(x|c_i)> 0\) |
Mixture | \(c_1 \vee c_2\) | \(\max_{i\in[1,2]}\text{lift}(x|c_i)> 0\) |
Negation | \(\neg c_1\) | \(\text{lift}(x|c_1)\leq 0\) |
Table 1: Examples of composition rules for multiple conditions.
This compositional approach lets us flexibly combine prompts into logical structures! We call this compositional criteria as CompLift. Here we show some examples in 2D synthetic tasks. Here, we use the Composable Diffusion as the baseline. + CompLift
means we use CompLift to reject the samples that do not satisfy the criteria among the Composable Diffusion samples.
Product Example





Mixture Example





Negation Example





⚙️ Efficient Implementation
In our paper, we investigate several design choices for CompLift. A TL;DR version is:
- Sharing sampled \((\epsilon, t)\) pairs between the estimation of \(p(x)\) and \(p(x \\| c)\) is most efficient.
- Importance sampling on \(t\) is required when the original diffusion model \(\epsilon_\theta\) is trained with importance sampling.
- We can reuse the calculation of \(\epsilon_\theta(x_t, c)\), that are computed when \(x\) is generated, to save compute when estimating \(\text{lift}\).
📝 Connection to Classifier-Free Guidance
One would notice that the above implementation is similar to the classifier-free guidance technique (CFG) - we have \(\epsilon_\theta(x_t, c)\) and we compare it with \(\epsilon_\theta(x_t, \varnothing)\). In the appendix of our paper, we show that they are dual to each other from the Lagrangian perspective.
CompLift solves the primal problem - the sampler tries to generate samples that satisfy the ground truth condition.
CFG / Composable Diffusion solves the dual problem - the sampler tries to generate samples that are close to the ground truth condition and keeps a balance using the Lagrangian coefficient \(\lambda_i\).
Note:
- CFG / Composable Diffusion often sets the Lagrangian coefficient \(\lambda_i\) as a fixed weight value \(w\).
- The choice of \(p_{\text{generator}}(x_0)\) can vary. CompLift mainly uses \(p_{\text{generator}}(x_0)\) as \(p_\theta(x_0, c_{\text{prompt}})\), while CFG / Composable Diffusion often regards \(p_{\text{generator}}(x_0)\) as \(p_\theta(x_0, \varnothing)\).
🧪 Results
We evaluate CompLift on:
- 🎨 2D synthetic tasks
- 🧱 CLEVR object positioning
- 🖼 Text-to-image generation (SD 1.4/2.1/XL)
Across the board, CompLift significantly improves prompt alignment without harming quality. More details can be found in our paper.
Task | Metric | Baseline | +CompLift |
---|---|---|---|
2D Synthetic Task | Accuracy | 45.0% | 92.9% |
CLEVR Position (5 objects) | Accuracy | 78.7% | 90.3% |
SD 2.1 | CLIP Score | 0.342 | 0.352 |
Here we show some examples of the generated samples with Stable Diffusion XL. Blue-bordered images are accepted by the CompLift, while orange-bordered images are rejected.


















































In the examples above, CompLift shows great promise in improving prompt alignment. However, some limitations of CompLift can also be perceived:
- Since the ELBO estimation is an approximation, the lift score is not always accurate. This leads to some samples that are rejected but should be accepted, and vice versa (e.g., 1st image in “a frog with a bow” and 2nd image in “a lion with a bow”).
- Too small objects tend to be rejected, e.g., the 2nd image in “a red backpack and a yellow bowl”.
- Color-based conditions are not always identified effectively, e.g., the 5th image in “an orange backpack and a purple car”.
🔬 Lift in the Latent Space
To handle fine-grained image prompts, we compute lift scores per pixel in the latent space, allowing us to detect whether each object is truly present. This even helps in understanding which part of the image aligns with each prompt component.
Here are some examples showing how lift scores help identify different objects in complex scenes:




















The brighter regions in each visualization indicate where the lift scores are higher for that particular object or concept, showing how CompLift can identify the spatial location of each component in the prompt. Pixels with negative lift scores are masked and shown in black.
📊 Correlation between Lift Scores and CLIP Scores
We investigate the empirical correlation between the lift scores and the CLIP scores of the generated images. Here we evaluate the CLIP scores and the number of activated pixels (pixels with positive lift scores) for each generated image with the prompt “a black car and a white clock”.
💡 Mobile Tip: Tap any point to view the enlarged image. On mobile, you can scroll horizontally to see the full chart.
🧰 Get Started
Want to try it? Check out our GitHub repo!
git clone https://github.com/rainorangelemon/complift-t2i
cd complift-t2i
python run_lift.py --prompt "a black car and a white clock"
🧩 Final Thoughts
CompLift shows how diffusion models can be made compositional—by introspecting their own denoising behavior. It’s simple, elegant, and practical.
While we’ve focused on image generation, the principles behind lift scores may apply to video, music, or even language generation in the future.
📄 Paper
Chenning Yu and Sicun Gao. Improving Compositional Generation with Diffusion Models Using Lift Scores. ICML 2025. [ArXiv] [GitHub]
@inproceedings{yu2025improving, title={Improving Compositional Generation with Diffusion Models Using Lift Scores}, author={Yu, Chenning and Gao, Sicun}, booktitle={Proceedings of the 42nd International Conference on Machine Learning (ICML)}, year={2025} }