| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Flux159 314 days ago
	Tried this out with Cline using my own API key (Cerebras is also available as a provider for Qwen3 Coder via via openrouter here: https://openrouter.ai/qwen/qwen3-coder) and realized that without caching, this becomes very expensive very quickly. Specifically, after each new tool call, you're sending the entire previous message history as input tokens - which are priced at $2/1M via the API just like output tokens. The quality is also not quite what Claude Code gave me, but the speed is definitely way faster. If Cerebras supported caching & reduced token pricing for using the cache I think I would run this more, but right now it's too expensive per agent run.

6 comments

sysmax 314 days ago

Adding entire files into the context window and letting the AI sift through it is a very wasteful approach.

It was adopted because trying to generate diffs with AI opens a whole new can of worms, but there's a very efficient approach in between: slice the files on the symbol level.

So if the AI only needs the declaration of foo() and the definition of bar(), the entire file can be collapsed like this:

  class MyClass {
    void foo();
    
    void bar() {
        //code
    }
  }

Any AI-suggested changes are then easy to merge back (renamings are the only notable exception), so it works really fast.

I am currently working on an editor that combines this approach with the ability to step back-and-forth between the edits, and it works really well. I absolutely love the Cerebras platform (they have a free tier directly and pay-as-you-go offering via OpenRouter). It can get very annoying refactorings done in one or two seconds based on single-sentence prompts, and it usually costs about half a cent per refactoring in tokens. Also great for things like applying known algorithms to spread out data structures, where including all files would kill the context window, but pulling individual types works just fine with a fraction of tokens.

If you don't mind the shameless plug, there's a more explanation how it works here: https://sysprogs.com/CodeVROOM/documentation/concepts/symbol...

postalcoder 314 days ago

this works if your code is exceptionally well composed. anything less can lead to looney tunes levels of goofiness in behavior, especially if there’s as little as one or two lines of crucial context elsewhere in the file.

This approach saves tokens theoretically, but i find it can lead to wastefulness as it tries to figure out why things aren’t working when loading the full file would have solved the problem in a single step.

sysmax 314 days ago

It greatly depends on the type of work you are trying to delegate to the AI. If you ask it to add one entire feature at a time, file level could work better. But the time and costs go up very fast, and it's harder to review.

What works for me (adding features to huge interconnected projects), is think what classes, algorithms and interfaces I want to add, and then give very brief prompts like "split class into abstract base + child like this" and "add another child supporting x,y and z".

So, I still make all the key decisions myself, but I get to skip typing the most annoying and repetitive parts. Also, the code don't look much different from what I could have written by hand, just gets done about 5x faster.

DrBenCarson 314 days ago

Yep and it collapses in the enterprise. The code you’re referencing might well be from some niche vendor’s bloated library with multiple incoherent abstractions, etc. Context is necessarily big

sysmax 314 days ago

Ironically, that's how I got the whole idea of symbol-level edits. I was working on project like that, and realized that a lot of work is actually fairly small edits. But to do one right, you need to you need to look through a bunch of classes, abstraction layers, and similar implementations, and then keep in your head how to get an instance of X from a pointer to Y, etc. Very annoying repetitive work.

I tried copy-pasting all the relevant parts into ChatGPT and gave it instructions like "add support for X to Y, similar to Z", and it got it pretty well each time. The bottleneck was really pasting things into the context window, and merging the changes back. So, I made a GUI that automated it - showed links on top of functions/classes to quickly attach them into the context window, either as just declarations, or as editable chunks.

That worked faster, but navigating to definitions and manually clicking on top of them still looked like an unnecessary step. But if you asked the model "hey, don't follow these instructions yet, just tell me which symbols you need to complete them", it would give reasonable machine-readable results. And then it's easy to look them up on the symbol level, and do the actual edit with them.

It doesn't do magic, but takes most of the effort out of getting the first draft of the edit, than you can then verify, tweak, and step through in a debugger.

hooo 314 days ago

Totally agree with your view on the symbolic context injection. Is this how things are done with code/dev AI right now? Like if you consider the state of the art.

seunosewa 314 days ago

They search for the token of interest, e.g. grep -n then they read that line and the next 50 lines or so. They continue until they get to the end.

seunosewa 314 days ago

The Cerebras.ai plan offers a flat fee of $50 or $200.

The API price is not a reason to reject the subscription price.

dedene 314 days ago

The flat fee is for a fixed max amount of tokens per day. Not requests, tokens.

Havoc 314 days ago

This seems to be rate limited by message not token so the lack of cache may matter less

andhuman 314 days ago

No it’s by token. The FAQ says this:

> Actual number of messages per day depends on token usage per request. Estimates based on average requests of ~8k tokens each for a median user.

https://cerebras-inference.help.usepylon.com/articles/346886...

jtbayly 313 days ago

How did you find that? Are you sure it applies to Cerebras Code Pro or Max?

NitpickLawyer 314 days ago

Yes, but the new "thing" now is "agentic" where the driver is "tool use". So at every point where the LLM decides to make a tool use, there is a new request that gets sent. So a simple task where the model needs to edit one function down the tree, there might be 10 calls - 1st with the task, 2-5 for "read_file", then the model starts writing code, 6-7 trying to run the code, 8 fixing something, and so on...

itsafarqueue 314 days ago

Yup. If you’ve ever watched a 60+ minute agent loop spawning sub agents, your “one message” prompt leaves you several hundred messages in the hole.

Flux159 314 days ago

The lack of caching causes the price to increase for each message or tool call in a chat because you need to send the entire history back after every tool call. Because there isn’t any discount for cached tokens you’re looking at very expensive chat threads.

waldrews 313 days ago

Does caching make as much sense as a cost saving measure on Cerebras hardware as it does on mainstream GPU's? Caching should be preferred if SSD->VRAM is dramatically cheaper than recalculation. If Cerebras is optimized for massively parallel compute with fixed weights, and not a lot of memory bandwidth into or out of the big wafer, it might actually make sense to price per token without a caching discount. Could someone from the company (or otherwise familiar with it) comment on the tradeoff?

BenGosub 314 days ago

If they say it costs $50 per month, why do you need to make additional payments?

davidweatherall 314 days ago

$50 per month is their SaaS solution that let's you make 1000 requests per day. The openrouter cost is the raw API cost if you try to use qwen3-coder via the pay as you go model when using Cline

beastman82 314 days ago

the API price is not very relevant to this flat fee service announcement.

In fact it seems obvious that you should use the flat fee model instead