Complex-Edit

Abstract

We introduce Complex-Edit, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale.

Our approach follows a well-structured “Chain-of-Edit” pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments.

Our benchmark yields several notable insights:

Opensource models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases;
Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality;
Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics;
A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach;
We observe a “curse of synthetic data”: when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises — a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.

Dataset

An overview of our data generation pipeline.

Stage #1
Sequence Generation

A sequence of atomic instructions is produced for each image.

Stage #2
Simplification

Each atomic instruction is refined to eliminate extraneous details, preserving only the essential description of the editing process.

Stage #3
Instruction Compounding

Several atomic instructions are integrated into one comprehensive instruction.

Evaluation

Instruction Following

Whether the specified modifications are present in the output image.

Identity Perservation

Whether elements of the input image that should remain unchanged are indeed preserved.

Perceptual Quality

Aesthetic factors such as consistency in lighting and shadows, style coherence, and the seamless integration of elements.

Results

Effect of Increasing Complexity

Qualitative results with editing instructions at different complexity levels.

Higher complexity leads to a consistent decline in both Identity Preservation and Perceptual Quality across models, while Instruction Following tends to fluctuate depending on the specific model.

GPT-4o surpasses all the other models we tested in terms of all metrics, particularly in terms of Instruction Following and Perceptual Quality. However, images edited by GPT-4o completely lose the realistic style with highly complex instructions. See the following section for more details.

Results of GPT-4o - Curse of Synthetic Data

Qualitative results with realistic image inputs edited by GPT-4o based on instructions at \( C_8 \) complexity level.

Evaluation results with direct editing on real images with the instruction complexity at \( C_8 \).

When applying extremely complex editing instructions (\( C_8 \)) to real input images, the resulting outputs frequently lose their realistic appearance and adopt a distinctly synthetic aesthetic. Notably, UltraEdit is particularly susceptible to this effect compared to OmniGen and AnyEdit.

We attribute this trend to the composition of UltraEdit's training data, which contains a significantly higher proportion of synthetic images than the datasets used by OmniGen and AnyEdit.

This also applies to powerful models like GPT-4o. As shown in the figure above, GPT-4o's edited images lose their realistic style, even though it outperforms all other models overall, even though GPT-4o surpass all the other models we tested in terms of all 3 metrics. This suggests that GPT-4o's training heavily relies on synthetic data.

Sequential Editing

Qualitative results with instructions at different complexity levels.

Sequential editing refers to decompose a complex instruction into a sequence of atomic instructions then executing them sequentially in a CoT-like style.

Our results reveal that sequential editing yields a steady decline in performance across all three metrics, with accumulating visual artifacts and distortions—even for strong proprietary models such as Imagen3 and SeedEdit.

Best-of-N

Qualitative results of sequential editing with and without Best-of-4 with OmniGen on real input images.

For sequential editing, a Best-of-N strategy produces significant gains in Identity Preservation and Perceptual Quality; however, the improvement in Instruction Following is less consistent.

BibTeX

@misc{yang2025textttcomplexeditcotlikeinstructiongeneration,
    title={$\texttt{Complex-Edit}$: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark}, 
    author={Siwei Yang and Mude Hui and Bingchen Zhao and Yuyin Zhou and Nataniel Ruiz and Cihang Xie},
    year={2025},
    eprint={2504.13143},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2504.13143}, 
}

Acknowledge

We would like to thank Google Cloud Research Credits Program, and the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs.

Complex-Edit

CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

An illustration of our Complex-Edit Benchmark. This figure presents a structured progression of instruction complexity in image editing tasks, highlighting the transition from atomic edits to highly intricate transformations.

BibTeX

Acknowledge