Building a Image Editing Pipeline

December 31, 2025

Building an Image Editing Pipeline on Limited Hardware (Shared H200s)

In the past week, the researchers at Qwen cooked up two amazing open source/weight models:

  • Qwen-image-layered (a model that can decompose an image into what the model thinks would be the layers that compose the image)
  • Qwen-image-edit (a model that is built on top of Qwen-image and is supposedly better at preserving the existing elements in an image for both high and low level changes described with a prompt)

One obvious use case that came to mind was to build an image editing pipeline that would allow an artist to edit the layers of a generated image by either downloading and importing the layers into a tool like GIMP or Photoshop or by using a selected layer and editing it with Qwen-image-edit. I decided to build it on HuggingFace's shared H200s (ZeroGPU).

The models aren't perfect and I'm leaning towards the idea that the translation of the text modality to image modality will always be somewhat ambiguous. (This is true for humans as well, give 10 different artist a description of what you want, even with as much context as possible and you will get 10 different art pieces) That being said I went for it since...why not?

Here it is if you want to try it out: Do note that you might need a hugging face pro account given the amount of GPU time required to run the models. Approximately ~ 180 seconds per inference which is beyond the free or anon tiers.

Qwen-image-layered-to-image-edit-pipeline

The engineering challenge, Resource Management:

This project wasn't just about hooking up APIs. The real challenge was Orchestration vs. Memory. I naively started by pre-loading both models to minimize latency. This immediately caused OOM (Out of Memory) crashes because the combined weights of the freshly minted Qwen-layered and Qwen-edit without quantization exceed the standard VRAM limits on shared H200 instances.

The Solution: Aggressive Swapping

To make this run on Hugging Face ZeroGPU (shared H200s), I implemented a strict mutual exclusion pattern in the backend:

  1. Segmentation Phase: Load the Layered model into GPU → Run Inference → Unload completely to CPU/Disk.
  2. Editing Phase: Load the Edit model into GPU → Run Inference.

The tradeoff is obvious: higher latency (loading time) in exchange for the ability to run state-of-the-art pipelines on basically free hardware. (Hugging face pro is about $9 per month which gives about 25mins of H200s usage per day among many other benefits.)

That being said, there are now some quantized versions provided by Unsloth which I will play around with in the near future.

Code Architecture

  • app.py (Entry Point): Handles initialization, authentication (Hugging Face Hub), and launches the UI. It acts as the orchestrator.
  • ui.py (Presentation Layer): explicit separation of UI logic. It defines the Gradio Blocks, manages state, and wires inputs to inference functions.
  • utils.py (Shared Utilities): Generic helper functions.

Critical pieces:

  • inference.py (Business Logic): encapsulates the actual execution of the models. It handles data preprocessing (image conversion), seed management, and interfaces with the GPU backend.
  • models.py (Data Access/Infrastructure): The "singleton-like" manager for model loading. This is where I had to write very specific code to get around the resource constraints of using a HuggingFace shared H200(Aka their "ZeroGPU offering"). It enforces a strict mutual exclusion pattern to ensure that only one model is loaded at a time.

While building this, I also wrote a small utility to keep both my hugging face and github repos in sync Huggit I plan on adding some more features like git LFS management between the two repos etc. So stay tuned! If you want to try it out now, just run npm i -g huggit to install it and run huggit init to initialize it in your repo.