Show HN: Rayline routes Claude Code subagents to on-device and cheaper models

Y	Hacker News new \| ask \| show \| jobs

Show HN: Rayline routes Claude Code subagents to on-device and cheaper models (rayline.ai)

11 points by davidvgilmore 50 days ago

Hi HN,

I’m one of the builders of Rayline.

Rayline is a Claude Code compatible LLM gateway. It intercepts and overrides claude code’s internal routing and lets you route subagent calls to different models instead. For example, you can run the main agent on Opus, some subagents on cloud-hosted open models, and other subagents on-device.

We’ve seen others implement routing for claude code as tools the agent can invoke. In our experience, that doesn’t work well because it requires the main agent to use tokens to think about + call the tools, and LLMs are generally a very inefficient way to make routing decisions. By implementing Rayline as a gateway, we let users deterministically configure routing decisions, and you can optionally use our ML model to make routing decisions.

We built it after noticing that Claude Code sessions contain a lot of subagent calls that don’t all need the same model. Other routers exist, but we built Rayline to let us continue using claude code (no separate harness), route tasks at a subagent level, and route across cloud and on-device. The main agent often benefits from Opus. But many delegated calls have narrow scope: search the repo, summarize context, inspect an error, poll for CI updates, etc.

The thing we’re exploring is subagent-level routing. The main cost lever in coding agents is usually cached vs non-cached input. Subagent delegations are a natural point to make routing decisions because you avoid busting cache. We look at the message-thread context for a delegated call and choose a model for that call. At a task level, Sonnet and Haiku are almost always less capability-per-dollar than open models, so the main advantage is better + (much) cheaper subagents (60-90% in our private beta).

The whole world seems to have started talking about model routing in the past two weeks, so apparently others agree it’s a relevant product area.

We’d love to get feedback from the HN community!

3 comments

Hans_Cui 48 days ago

The hard part with these routers is deciding the cheap model is "good enough" without already knowing the answer.RouteLLM trained classifiers for it, Not Diamond sells it as a service. Curious how you decide a subagent task is safe to send on-device?

link

camomileandmilk 50 days ago

Can you elaborate on this "Sonnet and Haiku are almost always less capability-per-dollar than open models"?

link

davidvgilmore 50 days ago

Yes - in short, open models like Deepseek, Mimo, Kimi, and GLM tend to complete tasks with less tokens and cost less per token than both Sonnet and Haiku. So those models are more cost efficient, and we often think of that as them having higher "capability-per-dollar" than Sonnet or Haiku.

Much of Claude Code's internal model routing ends up delegating tasks to Sonnet or Haiku, so by intercepting those calls and using open models instead, we often see better performance at a better price.

link

camomileandmilk 50 days ago

yeah, I get you now. but those are all Chinese hosted right? Don't think my company will enable us using them.

link

davidvgilmore 50 days ago

Many of them are produced by Chinese labs. Some, like Neomotron, are U.S. made. And we support inference providers in both the U.S. and overseas.

If geography is important, we can restrict which geos inference takes place in. And if you don't want to use Chinese-trained models, you can use others like Mistral, Neomotron, Google's, or OpenAI's.

link

oypass 50 days ago

How is this different from open router?

link

davidvgilmore 50 days ago

Four ways: (1) We are built specifically for Claude Code model routing. (2) We route at a subagent/subtask level. (3) We support on-device routing. (4) We have a built-in ML router trained specifically to route Claude Code subagent tasks. Its use is optional.

link

oypass 50 days ago

What is the benefits of on device routing? How do you decide if the task can be run on device?

link

davidvgilmore 50 days ago

For those that have capable enough hardware, it's effectively free to run subtasks on-device. (just the marginal cost of additional electricity).

With Google's most recent 12b param Gemma model, even Mac users with just 16gb of unified memory can offload some tasks on-device.

link