| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by SatvikBeri 30 days ago
	The context window has nothing to do with RAM usage and even if it did, a million tokens of context is maybe 5mb.

2 comments

bluegatty 30 days ago

'A million tokens of context' is literally Terrabytes of KV cache VRAM on very expensive Nvidia silicon - on the model.

On the Agent, yes, the context window does relate to RAM, because the 'entire conversational history' is generally kept in memory. So ballpark 1M 'words' across a bunch of strings. It's not that-that much.

Claude Code is not inneficient because 'it's not Rust' - it's just probably not very efficiently designed.

Rust does not bestow magical properties that make memory more efficient really.

A bit more, but it's not going to change this situation.

'Dong it in Rust' might yield amazing returns just because the very nature of the activity is 'optimization'.

link

rixed 30 days ago

Rust "denialism" is as annoying as rust evangelism.

Of course any seemingly idiomatic rust is going to run circles around TS transpiled into JIT-compiled JS.

link

bluegatty 30 days ago

Lamenting any 'not even criticism' of Rust as 'denialism' is just evidence of the insane cult that is Rust.

Rebuilding Claude Code in Rust will make almost no difference in terms of real world performance. V8 is 'relatively fast', and there wouldn't be any noticeable improvements there, and probably not memory footprint either.

The source for Claude Code was leaked and it's a vibe-coded mess, there's not much thought given to clean architecture, it's unlikely they've just cleaned up a bit and given thought to memory consumption etc, if they did, they'd get by far most of the way there and likely abnegate and real want to 'do it in rust', unless there are other architectural considerations.

link

imtringued 30 days ago

You're the delusional one for bringing up the memory usage of the inference server that clearly isn't running inside the coding agent.

The problem with your comments is that you're showing off a fundamental lack of understanding between managed languages and unmanaged languages.

The vast majority of GCs are optimized for throughput and allocate big chunks of memory. They also tend to never release it if there was a temporary memory spike. The most advanced GCs also tend to have either read or write barriers, which slow down basic object accesses.

Just in time compilation and managed languages in general need to retain a runtime representation of the source code to perform JIT compilation and then they have to store the compiled code in memory as well.

JavaScript uses references against dynamic objects, which means you have to pay the indirection cost of a pointer but you also need to store type information as well to monomorphize the object literals and classes at runtime and fall back to a regular hashmap when fields are added dynamically.

All of these things will add up and increase the amount of memory the application uses and how slow it runs.

Sure Claude Code has severe architectural issues causing it to leak hundreds of gigabytes of RAM, but if those were not there you could easily build a C++ based alternative that runs circles around a hypothetical JavaScript based Claude Code that got its act together.

link

bluegatty 29 days ago

1) I'm not 'delusional' for bringing up 'What Memory is Used Where' - I'm clarifying for the people who seem a bit confused (see above) as to 'where the context lives' - and trying to provide a simple mental model for that.

That's the opposite of delusional.

It's just information.

Attacking people for anything 'Rust related' however - is the quintessential reason why everyone hates the Rust community.

2) 'The problem with your comment' is that it's presumptive and arrogant - as if I 'don't know the difference between GC and managed languages'.

I've been writing software since 1990.

Embedded (on custom Silicon), UI, SaaS, backend, some embedded work I've done is still in production today from almost 30 years ago.

I've written a scripting languages (for production), and cyclic ref-count gc (didn't make it to production).

Your comments about GC etc. are fine - but they but they don't really offer any insight into the actual problem.

There's one critical detail aka 'memory not released after spikes', yes, this is observed behaviour, but it's usually accommodated with a little bit of decent Engineering.

If you're going to make the comparative basis an an 'Idiomatic Rust' solution (aka good patterns), the we should make the assumption of an 'Idiomatic Node' solution for Claude Code.

3) 'The other problem with your comment' is that your conclusion is wrong - by your own hand.

Right here: "Claude Code has severe architectural issues causing it to leak hundreds of gigabytes of RAM," - the implication being that Claude Claude does not inherently have to 'leak all that RAM' - and would run just as fine with some basic work.

An 'Idiomatic Node' implementation of Claude Code wouldn't exhibit those problems, and would perform pragmatically just as well as an Idiomatic Rust implementation.

From a memory management situation, Rust might use significantly less memory, but a 150Mb footprint vs 350Mb foot print for an average session is 'pragmatically immaterial'.

The difference in 'perceived performance' would be negligible - if any.

The 'cost' of writing a the 'kind of program that Claude code is' in a systems-level language would be quite a lot, for not really much benefit.

The 'Rust or C++' solution would not 'run circles' around the 'node' implementation in anything but some 'preformative', inward looking benchmarks, aka 'the worst kind of Engineering'.

Consider pondering why almost nobody writes such applications in Rust or C++.

link

regexorcist 30 days ago

You have a point but it's definitely not TBs for 1M. Should be more like 100G.

link

vlovich123 30 days ago

It has nothing to do with local RAM usage. But a million tokens of LLM context is decidedly not 5mb.

The rough estimate is 2 * L * H_kv * D * bytes per element

Where:

* L = number of layers * H_kv = # of KV heads * D = head dimension * factor of 2 = keys + values

The dominant factor here is typically 2 * H_kv * D since it’s usually at least 2048 bytes. Per token.

For Llama3 7B youre looking at 128gib if you’re context is really 1M (not that that particular model supports a context so big). DeepSeek4 uses something called sparse attention so the above calculus is improved - 1M of context would use 5-10GiB.

But regardless of the details, you’re off by several orders of magnitude.

link

tujux 30 days ago

Pretty sure we're talking about the output text, not the tensors.

link

m00x 30 days ago

These LLM replies are really getting annoying.

link

vlovich123 30 days ago

Mine? I literally wrote what I wrote because “context window” as a term of art refers to the LLM’s context window.

I guess get better at detecting LLMs instead of accusing everything of being an LLM reply?

link