CompCon: Discovering Divergent Representations between Text-to-Image Models

CompCon teaser showing divergent representations discovery

Discovering divergent representations with CompCon. Left: CompCon takes as input a pair of text-to-image models and outputs a diverging prompt description to produce a diverging visual attribute appearing in one model but not the other. Right: We show the discovered diverging visual attribute 'flames' appearing in PixArt but not SDXL-Lightning over different diverging prompts.

Abstract

In this paper, we investigate when and how visual representations learned by two different generative models diverge. Given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, 'flames' might appear in one model's outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts. We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model's output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate CompCon's ability to find diverging representations, we create an automated data generation pipeline to produce ID², a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets and Stable Diffusion 3.5 depicts African American people in media professions.

Method Overview

CompCon overview. We illustrate our approach for discovering diverging visual attributes (top) and diverging prompt descriptions (bottom). Given two text-to-image models and a set of prompts, we use a VLM to identify visual differences. For each diverging attribute, we iteratively refine diverging prompt descriptions by generating candidate prompts and classifying them as diverging or non-diverging.

Dataset Creation

ID² (Input-Dependent Differences) Dataset

We created ID², a benchmark dataset of 60 divergent representations between text-to-image models. Since model differences aren't known beforehand, we use a simulation approach: one model generates different images by modifying prompts to include specific visual attributes.

ID² creation process: Given a diverging prompt description and diverging visual attribute, we use an LLM to generate prompt pairs where one of the prompts mentions the diverging visual attribute. Both prompts are then passed to the same text-to-image model to generate image pairs with the visual difference.

Results

Benchmark Results

ID² Benchmark Performance

Metric	Method	Top-1	Top-5
Attribute Score	CompCon	0.60	0.68
	VisDiff	0.47	0.62
	TF-IDF	0.23	0.37
	LLM-only	0.08	0.24
Description Score	CompCon [5-iter]	0.64	0.78
	CompCon [1-iter]	0.59	0.72
	TF-IDF	0.40	0.57
	LLM-only	0.03	0.28

ID² benchmark example showing Venetian blinds divergent representation

Model Comparisons: PixArt vs SD-Lightning

Our method reveals systematic differences in how PixArt and SD-Lightning interpret and visualize the same concepts, exposing distinct model biases and representational patterns.

PixArt vs SD-Lightning main results — **Main Results:** Wet streets and flames in PixArt; decay representation differences.

Repetitive pattern divergence — **Repetitive Patterns:** Models show different tendencies for generating repetitive visual elements.

Person in the distance divergence — **Person in the Distance:** PixArt generates distant figures in metaphysical/universe-related prompts.

Women in PixArt — **Women Depictions:** PixArt generates women for prompts about artistic interpretations of human emotions.

Creepy atmosphere divergence — **Creepy Atmosphere:** Different atmospheric interpretations across models.

Bias Detection

CompCon can automatically discover various types of bias in text-to-image models, including racial, age, and gender biases in professional contexts.

Racial bias in media professions — **Racial Bias:** Stable Diffusion 3.5 depicts significantly more African American people in media and communication roles.

Age bias in traditional professions — **Age Bias:** PixArt consistently generates older men for traditional, manual, or historical professions.

Gender bias with accessories — **Gender & Accessories:** SD-Lightning generates "women with glasses" for white-collar occupations in formal office environments.

Male professional depictions — **Male Representation:** Men associated with creative, nurturing, or socially-oriented roles (PixArt).

Citation

@inproceedings{dunlap2025compcon,
  title={Discovering Divergent Representations between Text-to-Image Models},
  author={Dunlap, Lisa and Gonzalez, Joseph E. and Darrell, Trevor and Caba Heilbron, Fabian and Sivic, Josef and Russell, Bryan},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2025}
}