I’m one of the builders. Once AI requests moved beyond simple sync calls, we kept running into the same problems in production: retries hiding failures, async flows that were hard to reason about, frontend state drifting, and providers timing out mid-request.
This page breaks down the three request patterns we see teams actually using in production (sync, async, and event-driven async), how data flows in each case, and why we ended up favoring an event-driven approach for interactive, streaming apps.
Happy to answer questions or go deeper on any part of the architecture.
If a team adopts this pattern and later decides to remove ModelRiver, how hard is it to unwind? Are the request and event models close to provider APIs or fairly opinionated?
This was something we were careful about. The request and event models are intentionally close to what most providers already expose, rather than introducing a completely new abstraction.
Teams usually integrate it incrementally in front of existing calls. If you remove it, you’re mostly deleting the orchestration layer and keeping your provider integrations and client logic. You lose centralized retries and observability, but you’re not stuck rewriting your entire request model.
If adopting it requires a full rewrite, that’s usually a sign it’s being applied too broadly.
How do you reason about retries and correctness once a stream has already started? For example, how do you avoid duplicated or missing tokens if a provider fails mid-stream?
This is one of the harder problems, and there isn’t a perfect answer.
The main thing we try to avoid is pretending mid-stream retries are the same as pre-request retries. Once a stream has started, we treat it as a sequence of events with checkpoints rather than a single opaque response. Retries are scoped to known safe boundaries, and anything ambiguous is surfaced explicitly instead of silently re-emitting tokens.
In other words, correctness is prioritized over pretending the stream is seamless. If we can’t guarantee no duplication, we make that visible rather than hide it.
In practice, event-driven starts to feel like overkill when requests are short-lived and failures are cheap. If a call is fast, idempotent, and the user isn’t waiting on partial output, a simple sync request is usually the clearest solution.
Queue-based async still works well for batch jobs, offline processing, or anything where latency and ordering aren’t user-visible. The event-driven approach mainly pays off once you have long-lived or interactive requests where failures can happen mid-response and you care about what the user actually sees.
That makes sense. How do you decide early on which requests are likely to “grow into” needing an event-driven approach, versus staying simple sync or queue-based long term?
In our experience, it usually comes down to whether the request has user-visible state over time. If the response is something you can treat as atomic and either succeed or fail cleanly, it tends to stay simple.
The requests that “grow” tend to share a few signals early on: they stream partial results, they take long enough that the frontend needs progress updates, or failures start happening after something has already been shown to the user. Another common signal is when retries stop being transparent and you start needing to explain to users what actually happened.
Once those patterns show up, teams usually end up reworking the flow anyway. The event-driven approach just makes that lifecycle explicit earlier, instead of letting it emerge implicitly and painfully over time.
I’m another founder on this. One thing that surprised us while building AI features was how often the hard problems weren’t about model choice, but about request lifecycle. Once you introduce streaming, retries, and multiple providers, a lot of implicit assumptions in typical request–response code stop holding.
We kept seeing teams reinvent similar patterns in slightly different ways, especially around correlating events, handling partial failures, and keeping the frontend in sync with what actually happened on the backend. The goal with this writeup was to make those tradeoffs explicit and show what’s actually happening on the wire in each approach.
Curious to hear how others here are handling long-lived or streaming AI requests in production, especially once things start failing in non-obvious ways.
This page breaks down the three request patterns we see teams actually using in production (sync, async, and event-driven async), how data flows in each case, and why we ended up favoring an event-driven approach for interactive, streaming apps.
Happy to answer questions or go deeper on any part of the architecture.