Engineering

Latency improvements across all generation modes

Dev Patel·Staff Engineer··6 min read

Speed is a feature. When a generation takes too long, momentum drains and people stop exploring. Over the last sprint we cut median generation time across images, video, code, and charts by roughly forty percent. None of it came from a single silver bullet — it was a stack of unglamorous wins. Here is what moved the needle.

Measure before you optimize

We started by instrumenting the full path, from the moment you hit enter to the first visible pixel. Breaking that timeline into stages — routing, model warm-up, generation, post-processing, delivery — showed us where the time actually went. The surprise, as usual, was that our intuition was wrong about which stage dominated. Most of the early latency was not in the model at all; it was in everything around it.

The changes that mattered

A handful of changes accounted for most of the improvement. Each was small on its own, but together they reshaped the curve.

  • Warm pools — keeping models ready instead of cold-starting them per request.
  • Smarter routing — choosing the model faster and skipping a redundant planning hop.
  • Streaming everywhere — showing partial output the instant it exists.
  • Lighter post-processing — moving export work off the critical path.

Cold starts were the quiet killer

The single biggest win was warm pools. Spinning a model up from cold added seconds that had nothing to do with the actual work. By keeping a small set of warm instances ready and predicting demand, we turned most cold starts into warm hits. The cost is a bit of standing capacity; the payoff is that the median request never pays the startup tax.

Perceived speed is real speed

We also leaned hard into streaming. Even when total work is unchanged, showing the first chunk of an image or the first lines of code immediately makes the experience feel dramatically faster. People are patient when they can see progress and impatient when they cannot. Optimizing time-to-first-pixel turned out to matter as much as optimizing time-to-done.

Guardrails so it stays fast

Speed regresses quietly if you let it. We added latency budgets to our continuous tests, so a change that slows a generation path fails the build the same way a broken test would. Performance is now something we defend on every commit, not something we rediscover once a quarter. The result is a workspace that feels quick today — and a process designed to keep it that way.

Turn your next idea into something real

One prompt becomes images, video, code, and charts. Start free — no credit card required.

Start free