We introduce Complex-Edit, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale.
Our approach follows a well-structured “Chain-of-Edit” pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments.
Our benchmark yields several notable insights:
A sequence of atomic instructions is produced for each image.
Each atomic instruction is refined to eliminate extraneous details, preserving only the essential description of the editing process.
Several atomic instructions are integrated into one comprehensive instruction.
Whether the specified modifications are present in the output image.
Whether elements of the input image that should remain unchanged are indeed preserved.
Aesthetic factors such as consistency in lighting and shadows, style coherence, and the seamless integration of elements.
@misc{yang2025textttcomplexeditcotlikeinstructiongeneration,
title={$\texttt{Complex-Edit}$: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark},
author={Siwei Yang and Mude Hui and Bingchen Zhao and Yuyin Zhou and Nataniel Ruiz and Cihang Xie},
year={2025},
eprint={2504.13143},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.13143},
}