Skip to content

Benchmarking Qwen 27B and ComfyUI on One RTX 5090

Published:
5 min read

I recently wanted to keep qwen3.5:27b loaded on a single RTX 5090 while still using ComfyUI for image generation.

That sounds simple until you remember what is really happening:

So I ran a benchmark to answer one practical question:

What is the best setup if I want both decent image speed and enough VRAM headroom to avoid constant trouble?

Test goal

The goal was not to find the single fastest benchmark score.

The goal was to find the best shared GPU setup while Qwen stayed resident in memory.

In other words:

Test setup

These settings stayed fixed during the benchmark:

I compared 8 combinations across three ideas:

You do not need to care about every implementation detail to understand the result. The main thing is this:

The main result

Here is the simplified version of the benchmark:

Setup3-image timeLowest free VRAMTakeaway
normal_reserve49.972s233MBFastest, but too close to the edge
normal_default11.467s212MBAlso fast, also risky
lowvram_default16.615s7913MBBest balance
lowvram_reserve419.105s7913MBSafer, but slower than needed
normal_cacheNone_reserve419.748s7433MBSafe, but throughput drops
normal_cacheNone20.475s7403MBSafe, but throughput drops
lowvram_cacheNone_reserve435.328s10185MBVery conservative, too slow
lowvram_cacheNone38.991s10155MBVery conservative, too slow

The fastest setup was clearly normal_reserve4.

But that profile only left 233MB of free VRAM at its lowest point.

That is not really operational headroom. That is a benchmark run gambling that nothing slightly heavier happens next.

lowvram_default was slower, but it was still reasonably fast and left almost 8GB of free VRAM.

That made it the clear winner for real usage.

Why I chose lowvram_default

This was the decision logic:

normal_reserve4 won on speed, but it behaved like a setup that would eventually bite me.

lowvram_default lost a few seconds across 3 images, but gave me a much healthier margin.

That mattered more because I was not building a one-off benchmark demo. I was trying to build a machine I could keep using without constantly watching memory graphs.

So my final ComfyUI choice was:

The second test: Qwen KV cache format at 28k context

I also compared Qwen’s KV cache modes because the LLM side matters just as much in a shared setup.

The measured dedicated VRAM usage looked like this:

KV cache modeVRAM usedDifference vs q4_0Eval rate
q4_023075.8MBbaseline62.17 tok/s
q8_023523.8MB+448MB62.94 tok/s
f1624243.8MB+1168MB59.44 tok/s

This was useful for one reason:

q8_0 was not dramatically more expensive than q4_0.

At a 28k context, the extra cost was only about 0.45GB.

That made q8_0 feel like a practical default for this kind of setup. It keeps the memory increase modest while avoiding the larger jump of f16.

The final operating rules I kept

After the benchmark, I settled on a simple operating policy:

That last rule matters because it turns “maybe the system survives” into a clearer control point.

I would rather accept a cold start on the language model than keep pretending every workload should run concurrently.

Final takeaway

The most useful lesson from this benchmark was simple:

On a shared RTX 5090 workflow, the best setup is usually not the fastest one.

The best setup is the one that leaves enough headroom to stay stable when the workload stops being perfectly clean.

For me, that meant accepting 16.615s instead of 9.972s for 3 images, because the safer profile left almost 8GB free instead of almost nothing.

That is a good trade if the goal is to keep both text and image workflows on one machine without constant babysitting.

New posts, shipping stories, and nerdy links straight to your inbox.

2× per month, pure signal, zero fluff.