I wanted one machine to do two jobs at the same time:
- keep a large language model loaded for chat and tools
- still generate images without the GPU constantly running into memory trouble
The machine was an RTX 5090.
The text model was qwen3.5:27b.
The image side was ComfyUI running z_image_turbo_nvfp4.safetensors.
The simple question was:
Can I keep Qwen resident in VRAM and still use ComfyUI without turning the whole setup into a crash-prone science project?
After benchmarking a few combinations, the answer was yes, but only if I stopped chasing the absolute fastest result.
The setup I kept
This is the configuration I ended up keeping:
- Qwen:
qwen3.5:27b - Ollama:
FLASH_ATTENTION=1 - Ollama KV cache:
q8_0 - Ollama context length:
28672 - ComfyUI:
--lowvram - Workflow UNet:
z_image_turbo_nvfp4.safetensors
I also kept two operating rules:
- heavy GPU jobs should run one at a time
- if free VRAM drops below
4GB, stop Qwen before starting a heavy image run
That was the best balance between speed and stability.
Why the fastest setup was not the right setup
The fastest ComfyUI profile I tested finished 3 images in 9.972s.
That sounds great until you look at the memory headroom:
- lowest free VRAM:
233MB
That is effectively no safety margin.
It might survive a clean benchmark run, but it leaves almost no room for background overhead, slightly heavier prompts, or a different workload later.
The setup I actually kept was slower:
- 3 images in
16.615s - lowest free VRAM:
7913MB
That tradeoff was easy to accept.
I gave up a bit of raw speed, but in return I got a much larger buffer and a setup I could trust for normal use.
The small surprise on the Qwen side
I also tested different KV cache formats for Qwen at a 28k context length.
These were the results:
q4_0:23075.8MBq8_0:23523.8MBf16:24243.8MB
The useful part was this:
q8_0 only used about 448MB more VRAM than q4_0.
That was a smaller penalty than I expected, and it made q8_0 feel like the sensible default for a large context setup.
f16 used about 1.17GB more than q4_0, which felt harder to justify in a shared-GPU workflow.
My practical takeaway
If you want one RTX 5090 to handle both a large local LLM and image generation, the winning strategy is not “push the card until it almost explodes”.
The better strategy is:
- keep the language model efficient but not too aggressive
- let ComfyUI use a safer memory mode
- leave real VRAM headroom
- serialize heavy work instead of pretending everything should run at once
For me, that meant:
- Qwen with
q8_0and28kcontext - ComfyUI with
--lowvram - a hard fallback rule below
4GBfree VRAM
That setup was not the benchmark winner on paper.
It was the setup that actually made sense to live with.
If you want the full benchmark numbers, I also published the detailed write-up: Benchmarking Qwen 27B and ComfyUI on One RTX 5090.