DoodlePix

Diffusion based Drawing Assistant

Aka

Draw like a 5-year-old but get Great results!

Pipeline

Inference: fits in < 4GB
Resolution: 512x512px
Speed: ~15 steps/second

Training

Base Model: StableDiffusion 2.1
Training Requirements: < 14GB
Setup: NVIDIA RTX4070

Training Loop

The model is trained using the InstructPix2Pix pipeline modified with the addition of a Multilayer Perceptron (FidelityMLP). The training loop processes input_image, edited_target_image and text_prompt with embedded fidelity f[0-9]. Input images are encoded into the latent space (VAE encoding), The prompt is processed by a CLIP text encoder, and the extracted fidelity value ($F \in [0.0, 0.9]$) generates a corresponding fidelity embedding (through the FidelityMLP).

The core diffusion process trains a U-Net to predict the noise ($\epsilon$) added to the VAE-encoded edited target latents. Then U-Net is conditioned on both the fidelity injected text embeddings (via cross-attention) and the VAE-encoded input image (doodles) latents.

The optimization combines two loss terms:

A reconstruction loss ($||\epsilon - \epsilon_\theta||^2$), minimizing the MSE between the sampled noise ($\epsilon$) and the U-Net's predicted noise ($\epsilon_\theta$).
A fidelity-aware L1 loss, calculated on decoded images ($P_{i}$), which balances adherence to the original input ($O_{i}$) and the edited target ($E_{i}$) based on the normalized fidelity value $F$: $F \cdot L1(P_{i}, O_{i}) + (1 - F) \cdot L1(P_{i}, E_{i})$.

The total loss drives gradient updates via an AdamW optimizer, simultaneously training the U-Net and the FidelityMLP.

Dataset

Data Size: ~4.5k images
Image Generation: Dalle-3, Flux-Redux-DEV, SDXL
Edge Extraction: Canny, Fake Scribble, Scribble Xdog, HED soft edge, Manual
Doodles were hand-drawn and compose about 20% of the edges

Fidelity Embedding in Action

Fidelity values range from 0 to 9 while keeping prompt, seed, and steps constant.

Prompt: f*, red heart, white background.
Image	Normal	3D	Outline	Flat

-The model also accepts canny edges as input, while keeping fidelity injection relevant

Prompt: f*, woman, portrait, frame. black hair, pink, black background.
Image	Normal	3D	Outline	Flat

More Examples

Prompt: f*, potion, bottle, cork. blue, brown, black background.		Prompt: f*, maul, hammer. gray, brown, white background.		Prompt: f*, torch, flame. red, brown, black background.


input	F0	F9	input	F0	F9

LORAs

Lora training allows you to quickly bake a specific Style into the model.


input	Googh	DontStarve
input	Googh	DontStarve

Lora Examples

Googh

Loras retains Styles and Fidelity injection from DoodlePix


input	Normal	3D	Outline	Flat
Low Fidelity	High Fidelity


input	Normal	3D	Outline	Flat

More Examples:

DontStarve


Flower
Gift
Carrot
Rope
Potato
Heart
Axe
Potion
Torch

The model shows great color understanding.

Prompt: f9, flower, stylized. *color, green, white
input	red	blue	purple
green	cyan	yellow	orange

Limitations

The Model was trained mainly on objects, items. Things rather than Characters.
It inherits most of the limitations of the StableDiffusion 2.1 model.

Reasoning

The objective is to train a model able to take drawings as inputs.

While most models and controlnets were trained using canny or similar line extractors as inputs (which focuses on the most prominent lines in an image), drawings are made with intention. A few squiggly lines placed in the right place can sometimes deliver a much better idea of what's being represented in the image:

Drawing
Drawing

Canny

Although the InstructPix2Pix pipeline supports an ImageGuidance factor to control adherence to the input image, it tends to follow the drawing too strictly at higher values while losing compositional nuances at lower values.

TODOs

DATA

[ ] Increase amount of hand-drawn line inputs
[X] Smaller-Bigger subject variations
[ ] Background Variations
[ ] Increase Flat style references
[ ] Improve color matches in prompts
[ ] Clean up