DiffDoctor: Diagnosing Image Diffusion Models Before Treating

Yiyang Wang1   Xi Chen1   Xiaogang Xu4   Sihui Ji1  
1The University of Hong Kong   2Tongyi Lab   3Ant Financial Services Group   4Zhejiang University
pipeline

Image diffusion models inevitably generate artifacts. We train a robust artifact detector to diagnose the artifacts and treat the diffusion model. After treating on limited prompts, the diffusion model generates fewer artifacts of similar types on unseen prompts while maintaining the quality.

Diagnose: Detect Artifacts

PartComp

We compare our best artifact detector with the artifact detector trained on previous data (Prev.). The input images contain real photos, images synthesized by SD1.5, and images synthesized by FLUX.1.

Treat: Tune Diffusion Models

texture

DiffDoctor on FLUX.1. All images are synthesized based on randomly generated unseen prompts not involved in training, and on the same seeds for the corresponding images before and after DiffDoctor. After treating, artifacts in images are reduced, but the content and layouts of the images are almost unchanged, demonstrating the effectiveness of pixel-aware treating.

DiffDoctor on DreamBooth

texture

We perform DiffDoctor on the instance prompt and evaluate other various prompts following DreamBooth.

DiffDoctor vs Hand-Fix LoRA

texture

We compare DiffDoctor with Hands-XL on the same prompts and seeds.

Abstract

In spite of the recent progress, image diffusion models still produce artifacts. A common solution is to refine an established model with a quality assessment system, which generally rates an image in its entirety. In this work, we believe problem-solving starts with identification, yielding the request that the model should be aware of not just the presence of defects in an image, but their specific locations. Motivated by this, we propose DiffDoctor, a two-stage pipeline to assist image diffusion models in generating fewer artifacts. Concretely, the first stage targets developing a robust artifact detector, for which we collect a dataset of over 1M flawed synthesized images and set up an efficient human-in-the-loop annotation process, incorporating a carefully designed class-balance strategy. The learned artifact detector is then involved in the second stage to tune the diffusion model through assigning a per-pixel confidence map for each synthesis. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness of our artifact detector as well as the soundness of our diagnose-then-treat design.

Overall Pipeline

Overall Pipeline

Pipeline of DiffDoctor. The first part shows the training of an artifact detector – the doctor. Starting with the initial dataset, the artifact detector is trained in a humans-in-a-loop manner. The second part shows our diagnose-then-treat design, where the patient – a trainable diffusion model, is prompted to synthesize images. Then the frozen artifact detector diagnoses its result by predicting the artifact maps, on which it treats the patient by minimizing the per-pixel artifact confidence to back-propagate to the diffusion model.

BibTeX


      @article{wang2025diffdoctor,
        title={DiffDoctor: Diagnosing Image Diffusion Models Before Treating},
        author={Wang, Yiyang and Chen, Xi and Xu, Xiaogang and Ji, Sihui and Liu, Yu and Shen, Yujun and Zhao, Hengshuang},
        journal={arXiv:2501.12382},
        year={2025}
      }