DoodlePix
Diffusion based Drawing Assistant
Aka
Draw like a 5-year-old but get Great results!
Pipeline
- Inference: fits in < 4GB
- Resolution: 512x512px
- Speed: ~15 steps/second
Training
- Base Model: StableDiffusion 2.1
- Training Requirements: < 14GB
- Setup: NVIDIA RTX4070
The model is trained using the InstructPix2Pix pipeline modified with the addition of a Multilayer Perceptron (FidelityMLP). The training loop processes input_image, edited_target_image and text_prompt with embedded fidelity f[0-9]
. Input images are encoded into the latent space (VAE encoding), The prompt is processed by a CLIP text encoder, and the extracted fidelity value ($F \in [0.0, 0.9]$) generates a corresponding fidelity embedding (through the FidelityMLP).
The core diffusion process trains a U-Net to predict the noise ($\epsilon$) added to the VAE-encoded edited target latents. Then U-Net is conditioned on both the fidelity injected text embeddings (via cross-attention) and the VAE-encoded input image (doodles) latents.
The optimization combines two loss terms:
- A reconstruction loss ($||\epsilon - \epsilon_\theta||^2$), minimizing the MSE between the sampled noise ($\epsilon$) and the U-Net's predicted noise ($\epsilon_\theta$).
- A fidelity-aware L1 loss, calculated on decoded images ($P_{i}$), which balances adherence to the original input ($O_{i}$) and the edited target ($E_{i}$) based on the normalized fidelity value $F$: $F \cdot L1(P_{i}, O_{i}) + (1 - F) \cdot L1(P_{i}, E_{i})$.
The total loss drives gradient updates via an AdamW optimizer, simultaneously training the U-Net and the FidelityMLP.
Dataset
- Data Size: ~4.5k images
- Image Generation: Dalle-3, Flux-Redux-DEV, SDXL
- Edge Extraction: Canny, Fake Scribble, Scribble Xdog, HED soft edge, Manual
- Doodles were hand-drawn and compose about 20% of the edges
Fidelity Embedding in Action
Fidelity values range from 0 to 9 while keeping prompt, seed, and steps constant.
Prompt: f*, red heart, white background. | ||||
Image![]() |
Normal![]() |
3D![]() |
Outline![]() |
Flat![]() |
-The model also accepts canny edges as input, while keeping fidelity injection relevant
Prompt: f*, woman, portrait, frame. black hair, pink, black background. | ||||
Image![]() |
Normal![]() |
3D![]() |
Outline![]() |
Flat![]() |
More Examples
Prompt: f*, potion, bottle, cork. blue, brown, black background. | Prompt: f*, maul, hammer. gray, brown, white background. | Prompt: f*, torch, flame. red, brown, black background. | |||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
input![]() |
F0![]() |
F9![]() |
input![]() |
F0![]() |
F9![]() |
LORAs
Lora training allows you to quickly bake a specific Style into the model.
input![]() |
Googh![]() |
DontStarve![]() |
input![]() |
Googh![]() |
DontStarve![]() |
Lora Examples
Googh
Loras retains Styles and Fidelity injection from DoodlePix
input![]() |
Normal![]() |
3D![]() |
Outline![]() |
Flat![]() |
Low Fidelity![]() |
High Fidelity![]() |
input![]() |
Normal![]() |
3D![]() |
Outline![]() |
Flat![]() |
More Examples:
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
DontStarve
Flower![]() |
![]() |
![]() |
![]() |
![]() |
Gift![]() |
![]() |
![]() |
![]() |
![]() |
Carrot![]() |
![]() |
![]() |
![]() |
![]() |
Rope![]() |
![]() |
![]() |
![]() |
![]() |
Potato![]() |
![]() |
![]() |
![]() |
![]() |
Heart![]() |
![]() |
![]() |
![]() |
![]() |
Axe![]() |
![]() |
![]() |
![]() |
![]() |
Potion![]() |
![]() |
![]() |
![]() |
![]() |
Torch![]() |
![]() |
![]() |
![]() |
![]() |
The model shows great color understanding.
Prompt: f9, flower, stylized. *color, green, white | |||
input![]() |
red![]() |
blue![]() |
purple![]() |
green![]() |
cyan![]() |
yellow![]() |
orange![]() |
Limitations
- The Model was trained mainly on objects, items. Things rather than Characters.
- It inherits most of the limitations of the StableDiffusion 2.1 model.
Reasoning
The objective is to train a model able to take drawings as inputs.
While most models and controlnets were trained using canny or similar line extractors as inputs (which focuses on the most prominent lines in an image), drawings are made with intention. A few squiggly lines placed in the right place can sometimes deliver a much better idea of what's being represented in the image:
Drawing![]() |
Canny![]() |
Although the InstructPix2Pix pipeline supports an ImageGuidance factor to control adherence to the input image, it tends to follow the drawing too strictly at higher values while losing compositional nuances at lower values.
TODOs
DATA
- [ ] Increase amount of hand-drawn line inputs
- [X] Smaller-Bigger subject variations
- [ ] Background Variations
- [ ] Increase Flat style references
- [ ] Improve color matches in prompts
- [ ] Clean up
Training
- [X] Release V1
- [ ] Release DoodleCharacters (DoodlePix but for characters)
- [X] Release Training code
- [X] Release Lora Training code
Credits
- This is a custom implementation of the Training and Pipeline scripts from the Diffusers repo
- Dataset was generated using Chat based DALLE-3, SDXL, FLUX-REDUX-DEV
- Edge extraction was made easy thanks to Fannovel16's ComfyUI Controlnet Aux
- ComfyUI was a big part of the Data Development process
- Around 30% of the images were captioned using Moondream2
- Dataset Handlers were built using PyQT
- Huge Thanks to the OpenSource community for hosting and sharing so much cool stuff