Meta-CoT: Enhancing Granularity and Generalization in Image Editing
Abstract
Meta-CoT enhances image editing by decomposing editing operations into task-target-understanding triplets and fundamental meta-tasks, improving both granularity and generalization through CoT-editing consistency rewards.
Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at https://shiyi-zh0408.github.io/projectpages/Meta-CoT/
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Generative Visual Chain-of-Thought for Image Editing (2026)
- EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization (2026)
- ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning (2026)
- HP-Edit: A Human-Preference Post-Training Framework for Image Editing (2026)
- InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning (2026)
- ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework (2026)
- UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
the two-level decomposition in meta-cot, especially the triplet of task, target, and required understanding, is a neat way to push chain-of-thought reasoning into the actual edit plan. my worry is that it presumes targets can be cleanly separated and edited independently, but many real edits couple targets in ways that change geometry, lighting, and context simultaneously. a formal ablation that tests triplet reasoning with explicit target granularity versus treating targets more holistically on tasks with strong cross-target dependencies would be really telling. the arxivlens breakdown helped me parse the method details, and it's a solid companion to the paper, e.g. check https://arxivlens.com/PaperView/Details/meta-cot-enhancing-granularity-and-generalization-in-image-editing-9906-7fb3fa30. overall i like the direction and the reported gains, but the edge-case robustness will decide if this scales to messier edits.
Get this paper in your agent:
hf papers read 2604.24625 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper