DoodlePix - Diffusion based Drawing Assistant

Fidelity-controlled image generation from doodle inputs using a modified InstructPix2Pix framework.

View on GitHub

DoodlePix

Diffusion based Drawing Assistant

Aka

Draw like a 5-year-old but get Great results!

WebPage WebPage

Doodle


Pipeline
  • Inference: fits in < 4GB
  • Resolution: 512x512px
  • Speed: ~15 steps/second

Training

Training Loop

The model is trained using the InstructPix2Pix pipeline modified with the addition of a Multilayer Perceptron (FidelityMLP). The training loop processes input_image, edited_target_image and text_prompt with embedded fidelity f[0-9]. Input images are encoded into the latent space (VAE encoding), The prompt is processed by a CLIP text encoder, and the extracted fidelity value ($F \in [0.0, 0.9]$) generates a corresponding fidelity embedding (through the FidelityMLP).

The core diffusion process trains a U-Net to predict the noise ($\epsilon$) added to the VAE-encoded edited target latents. Then U-Net is conditioned on both the fidelity injected text embeddings (via cross-attention) and the VAE-encoded input image (doodles) latents.

The optimization combines two loss terms:

  1. A reconstruction loss ($||\epsilon - \epsilon_\theta||^2$), minimizing the MSE between the sampled noise ($\epsilon$) and the U-Net's predicted noise ($\epsilon_\theta$).
  2. A fidelity-aware L1 loss, calculated on decoded images ($P_{i}$), which balances adherence to the original input ($O_{i}$) and the edited target ($E_{i}$) based on the normalized fidelity value $F$: $F \cdot L1(P_{i}, O_{i}) + (1 - F) \cdot L1(P_{i}, E_{i})$.

The total loss drives gradient updates via an AdamW optimizer, simultaneously training the U-Net and the FidelityMLP.


Dataset
  • Data Size: ~4.5k images
  • Image Generation: Dalle-3, Flux-Redux-DEV, SDXL
  • Edge Extraction: Canny, Fake Scribble, Scribble Xdog, HED soft edge, Manual
  • Doodles were hand-drawn and compose about 20% of the edges

Fidelity Embedding in Action

Fidelity values range from 0 to 9 while keeping prompt, seed, and steps constant.

Prompt: f*, red heart, white background.
Image
Heart Image
Normal
Heart Normal
3D
Heart 3D
Outline
Heart Outline
Flat
Heart Flat

-The model also accepts canny edges as input, while keeping fidelity injection relevant

Prompt: f*, woman, portrait, frame. black hair, pink, black background.
Image
Woman Image
Normal
Woman Normal
3D
Woman 3D
Outline
Woman Outline
Flat
Woman Flat

More Examples

Prompt: f*, potion, bottle, cork. blue, brown, black background. Prompt: f*, maul, hammer. gray, brown, white background. Prompt: f*, torch, flame. red, brown, black background.
Potion Image Potion Normal Maul Image Maul Normal Torch Image Torch Normal
input
Input
F0
Googh
F9
DontStarve
input
Input
F0
Googh
F9
DontStarve

LORAs

Lora training allows you to quickly bake a specific Style into the model.

input
Input
Googh
Googh
DontStarve
DontStarve
input
Input
Googh
Googh
DontStarve
DontStarve

Lora Examples

Googh

Loras retains Styles and Fidelity injection from DoodlePix

input
Input
Normal
Normal
3D
3D
Outline
Outline
Flat
Flat
Low Fidelity
Low Fidelity
High Fidelity
High Fidelity
input
Input
Normal
Normal
3D
3D
Outline
Outline
Flat
Flat

More Examples:

Input Normal 3D
Outline Flat Flat
Input Normal 3D
Outline Flat Flat

DontStarve

Flower
Input

Normal

Normal

Normal

Normal
Gift
Input

Normal

Normal

Normal

Normal
Carrot
Input

Normal

Normal

Normal

Normal
Rope
Input

Normal

Normal

Normal

Normal
Potato
Input

Normal

Normal

Normal

Normal
Heart
Input

Normal

Normal

Normal

Normal
Axe
Input

Normal

Normal

Normal

Normal
Potion
Input

Normal

Normal

Normal

Normal
Torch
Input

Normal

Normal

Normal

Normal

The model shows great color understanding.

Prompt: f9, flower, stylized. *color, green, white
input
Flower Input
red
Flower red
blue
Flower light blue
purple
Flower purple
green
Flower green
cyan
Flower cyan
yellow
Flower light green
orange
Flower orange
Limitations
  • The Model was trained mainly on objects, items. Things rather than Characters.
  • It inherits most of the limitations of the StableDiffusion 2.1 model.
Reasoning

The objective is to train a model able to take drawings as inputs.

While most models and controlnets were trained using canny or similar line extractors as inputs (which focuses on the most prominent lines in an image), drawings are made with intention. A few squiggly lines placed in the right place can sometimes deliver a much better idea of what's being represented in the image:

Drawing
Drawing
Canny
Canny

Although the InstructPix2Pix pipeline supports an ImageGuidance factor to control adherence to the input image, it tends to follow the drawing too strictly at higher values while losing compositional nuances at lower values.

TODOs

DATA
  • [ ] Increase amount of hand-drawn line inputs
  • [X] Smaller-Bigger subject variations
  • [ ] Background Variations
  • [ ] Increase Flat style references
  • [ ] Improve color matches in prompts
  • [ ] Clean up
Training
  • [X] Release V1
  • [ ] Release DoodleCharacters (DoodlePix but for characters)
  • [X] Release Training code
  • [X] Release Lora Training code

Credits