| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by amangsingh 81 days ago

A 500k line codebase for an agent CLI proves one thing: making a probabilistic LLM behave deterministically is a massive state-management nightmare. Right now, they're great for prompting simple sites/platforms but they break at large enterprise repos.

If you don't have a rigid, external state machine governing the workflow, you have to brute-force reliability. That codebase bloat is likely 90% defensive programming; frustration regexes, context sanitizers, tool-retry loops, and state rollbacks just to stop the agent from drifting or silently breaking things.

The visual map is great, but from an architectural perspective, we're still herding cats with massive code volume instead of actually governing the agents at the system level.

18 comments

ttcbj 80 days ago

I find it really strange that there is so much negative commentary on the _code_, but so little commentary on the core architecture.

My takeaway from looking at the tool list is that they got the fundamental architecture right - try to create a very simple and general set of tools on the client-side (e.g. read file, output rich text, etc) so that the server can innovate rapidly without revving the client (and also so that if, say, the source code leaks, none of the secret sauce does).

Overall, when I see this I think they are focused on the right issues, and I think their tool list looks pretty simple/elegant/general. I picture the server team constantly thinking - we have these client-side tools/APIs, how can we use them optimally? How can we get more out of them. That is where the secret sauce lives.

olejorgenb 80 days ago

The tools was mostly already known, no? (I wish they had a "present" tool which allowed to model to copy-paste from files/context/etc. showing the user some content without forcing it through the model)

AnotherGoodName 80 days ago

Yeah in fact one thing claude is freaking great at is decompilation.

If you can download it client side you can likely place a copy in a folder and ask claude

‘decompile the app in this folder to answer further questions on how it works. As an an example first question explain what happens when a user does X’.

I do this with obscure video games where i want to a guide on how some mechanics work. Eg. https://pastes.io/jagged-all-69136 as a result of a session.

It can ruin some games but despite the possibility of hallucinations i find it waaay more reliable than random internet answers.

Works for apps too. Obfuscation doesn’t seem to stop it.

nashadelic 79 days ago

Whoa, when did they come out with JA3?

acedTrex 80 days ago

> but so little commentary on the core architecture.

The core architecture is not interesting? its an LLM tui, theres not much there to discuss architecturally. The code itself is the actual fascinating train wreck to look at.

jayd16 80 days ago

Why are "tools" for local IO interesting and not just the only way to do it? I can't really imagine a server architecture that gets to read your local files and present them without a fat client of some kind.

What is the naive implementation you're comparing against? Ssh access to the client machine?

abossy 80 days ago

It's early days and we don't fully understand LLM behavior to the extent that we can assume questions like this about agent design are resolved. For instance, is an agent smarter with Claude Code's tools or `exec_command` like Codex? And does that remain true for each subsequent model release?

woodson 80 days ago

It’s a distinction that IMHO likely doesn’t make much difference, at least for the mostly automated/non-interactive coding agent use case. What matters more is how well the post-training on synthetic harness traces works.

sunir 80 days ago

It’s not surprising. There has been quite a bit of industrial research in how to manage mere apes to be deterministic with huge software control systems, and they are an unruly bunch I assure you.

RALaBarge 80 days ago

Sunir! Hope you are doing well man, I got a good chuckle from this.

sunir 80 days ago

I am! I’ll reach out in another channel to connect.

comboy 80 days ago

It's hard to tell how much it says about difficulty of harnessing vs how much it says about difficulty of maintaining a clean and not bloated codebase when coding with AI.

amangsingh 80 days ago

Why not both? AI writes bloated spaghetti by default. The control plane needs to be human-written and rigid -> at least until the state machine is solid enough to dogfood itself. Then you can safely let the AI enhance the harness from within the sandbox.

whiplash451 80 days ago

Were human organizations (not individuals) any good at the latter anyway?

chrismarlow9 80 days ago

We propped the entire economy up on it. Just look at the s&p top 10. Actually even top 50 holdings.

If it doesn't deliver on the promise we have bigger problems than "oh no the code is insecure". We went from "I think this will work" to "this has to work because if it doesn't we have one of those 'you owe the bank a billion dollars' situations"

jayd16 80 days ago

It's weird to look at the world like this. If they deliver doesn't that invalidate thousands of other business plans? What about paying for that?

If they fail, doesn't software and the giant companies that make it go back to owning the world?

xp84 80 days ago

“if they deliver”

As I’m reading this, I’m thinking about how in 1980. It was imagined that everyone needed to learn how to program in BASIC or COBOL, and that the way computers would become ubiquitous would be that everybody would be writing program programs for them. That turned out to be a quaint and optimistic idea.

It seems like the pitch today is that every company that has a software-like need will be able to use AI to manifest that software into existence, or more generally, to manifest some kind of custom solution into existence. I don’t buy it. Coding the software has never been the true bottleneck, anyone who’s done a hackathon project knows that part can be done quickly. It’s the specifying and the maintenance that is the hard part.

To me, the only way this will actually bear the fruit it’s promising is if they can deliver essentially AGI in a box. A company will pay to rent some units of compute that they can speak to like a person and describe the needs, and it will do anything - solve any problem - a remote worker could do. IF this is delivered, indeed it does invalidate virtually all business models overnight, as whoever hits AGI will price this rental X%[1] below what it would cost to hire humans for similar work, breaking capitalism entirely.

[1] X = 80% below on day 1 as they’ll be so flush with VC cash, and they’d plan to raise the price later. Of course, society will collapse before then because of said breaking of capitalism itself.

kubanczyk 80 days ago

> breaking capitalism

It seems non sequitur. This hypothetical scenario sounds like entrenching capitalism, because it would concentrate capital even more.

It would probably weaken democracy and weaken free market (esp. the job market), yes.

> society will collapse before then because of said breaking of capitalism itself

Or, maybe the society would continue to exist with even more inequality? And, of course, much changed from what it is today.

xp84 78 days ago

I suppose it depends on your perspective. I guess I mean broken kind of in the gaming sense, where a gameplay mechanic is 'broken' if you can exploit it to completely subvert the entire intended way it's supposed to work.

You could argue that capitalism was very not broken in 1960, when you could get a job at 18 selling shoes, driving a cab, or delivering milk or whatever, and support a family of five on your salary, save for retirement, and go on yearly vacations.

It's arguably somewhat broken today, when gestures around things are like this.

I'd say it would be entirely broken if AGI means a few hundred billionaires who have ownership stakes in an AI company simply capture all the wealth in the world while most of the rest starve, but the robots help you put down the peasant uprisings and farm and raise crops for you.

I agree with you though that technically, capitalism will still be 'going strong' unless the peasants are able to overpower the AI robot billionaire industrial complex and burn it all down.

pred_ 80 days ago

The time is ripe for deterministic AI; incidentally, this was also released today: https://itsid.cloud/ - presumably will be useful for anyone who wants to quickly recreate an open source Python package or other copyrighted work to change its license.

nyrikki 80 days ago

Can you please explain the use here? I tried the demo, and cat, cp, echo, etc... seem to do the exact same thing without the cost.

Their demo even says:

   `Paste any code or text below. Our model will produce an AI-generated, byte-for-byte identical output.`

Unless this is a parody site can you explain what I am missing here?

Token echoing isn't even to the lexeme/pattern level, and not even close to WSD, Ogden's Lemma, symbol-grounding etc...

The intentionally 'Probably approximately complete' statistical learning model work, fundamentally limits reproducibility for PAC/Stastical methods like transformers.

CFG inherently ambiguity == post correspondence problem == halt == open domain frame-problem == system identification problem == symbol-grounding problem == entscheidungsproblem

The only way to get around that is to construct a grammar that isn't. It will never exist for CFGs, programs, types, etc... with arbitrary input.

I just don't see why placing a `14-billion parameter identity transformer` that just basically echos tokens is a step forward on what makes the problem hard.

Please help me understand.

yw3410 80 days ago

It's satire - just see the About page.

ericfr11 80 days ago

April's fool. Check the career page

BloondAndDoom 80 days ago

I don’t understand what this is, is it satire? What is it supposed to be doing or solving?

climclam 80 days ago

Take a look at the demo or about page ;)

edit: or click 'Start Pro Trial'

BloondAndDoom 79 days ago

Tech world became so wild even in a topic that I’m confident I cannot say if something is real or satire. Amount of real but absolutely idiotic landing pages made me this way :)

nicoburns 80 days ago

Kinda depends how much of it is vibe coded. It could easily be 5x larger than it needs to be just because the LLM felt like it if they've not been careful.

saynay 80 days ago

Claude folks proudly claim to have Claude effectively writing itself. The CEO claims it will read an issue and automatically write a fix, tests, commit and submit a PR for it.

amangsingh 80 days ago

Bingo. And them 'being careful' is exactly what bloats it to 500k lines. It's a ton of on-the-fly prompt engineering, context sanitizers, and probabilistic guardrails just to keep the vibes in check.

whycombagator 80 days ago

> Right now, they're great for prompting simple sites/platforms but they break at large enterprise repos

Can you expand on this?

My experience is they require excessive steering but do not “break”

oblio 80 days ago

I think the "breakage" is in terms of conciseness and compactness, not outright brokenness.

Like that drunk uncle that takes half an hour and 20 000 words to tell you a 500 word story.

xp84 80 days ago

Indeed. In some ways, this is just kind of an extrapolation of the overall trend toward extreme bloat that we’ve seen in the past 15 years, just accelerated because LLMs code a lot faster. I’m pretty accustomed to dealing with Web application code bases that are 6-10 years old, where the hacks have piled up on top of other hacks, piled on top of early, tough-to-reverse bad decisions and assumptions, and nobody has had time to go back and do major refactors. This just seems like more of the same, except now you can create a 10 year-old hack-filled code base in three hours.

jessai202699 80 days ago

The terrifying thing is that LLMs turn "technical debt" into "synthetic debt" that accumulates in real-time.

When we use an agent that lacks a native way to consolidate its own context, we essentially force it to generate these 10-year-old hack-filled codebases by design. We’re over-engineering the "container" (the CLI logic) to babysit a "leaky" context.

If the architecture doesn't start treating long-term memory as a first-class citizen, we’re just going to see more of these 500k-line "safety nets" masking the underlying fragility of the agents.

cheesecompiler 80 days ago

There seem to be multiple mechanisms compensating for imperfect, lossy memory. "Dreaming" is another band-aid on inability to reliably store memory without loss of precision. How lossy is this pruning process?

It's one thing to give Claude a narrow task with clear parameters, and another to watch errors or incorrect assumptions snowball as you have a more complex conversation or open-ended task.

tracyhenry 80 days ago

> they break at large enterprise repos.

I don't know where you get this. you should ask folks at Meta. They are probably the biggest and happiest users of CC

batshit_beaver 80 days ago

You mean the company where engineers ask chat bots to write chess games in their spare time in order to hit their AI usage requirements? That Meta?

tracyhenry 80 days ago

idk why you bring this up. this is irrelevant to whether CC actually works at big corps

jimbokun 80 days ago

I missed that, source?

pancsta 80 days ago

You need state oriented programming to handle that. I know, because I made one. The keyword is „unpredictability”. Embrace nondeterminism.

bogdanoff_2 80 days ago

What do you mean by "actually governing the agents at the system level", and how is it different from "herding cats"?

amangsingh 80 days ago

Herding cats is treating the LLM's context window as your state machine. You're constantly prompt-engineering it to remember the rules, hoping it doesn't hallucinate or silently drop constraints over a long session.

System-level governance means the LLM is completely stripped of orchestration rights. It becomes a stateless, untrusted function. The state lives in a rigid, external database (like SQLite). The database dictates the workflow, hands the LLM a highly constrained task, and runs external validation on the output before the state is ever allowed to advance. The LLM cannot unilaterally decide a task is done.

I got so frustrated with the former while working on a complex project that I paused it to build a CLI to enforce the latter. Planning to drop a Show HN for it later today, actually.

skeledrew 80 days ago

> The database dictates the workflow, hands the LLM a highly constrained task, and runs external validation on the output before the state is ever allowed to advance.

This sounds like where lat.md[0] is headed. Only thing is it doesn't do task constraint. Generally I find the path these tools are taking interesting.

[0] https://github.com/1st1/lat.md

amangsingh 80 days ago

I looked into lat.md. They are definitely thinking in the same direction by using a CLI layer to govern the agent.

The key difference is the state mechanism. They use markdown; I use an AES-encrypted SQLite database.

Markdown is still just text an LLM can hallucinate over or ignore. A database behind a compiled binary acts as a physical constraint; the agent literally cannot advance a task without satisfying the cryptographic gates.

I just dropped the Show HN for it here if you want to check out the architecture: https://news.ycombinator.com/item?id=47601608

mywacaday 80 days ago

I started that very personal project on Monday, waiting with baited breath, make sure to add a sponsor me a coffee link.

amangsingh 80 days ago

Just posted it here: https://news.ycombinator.com/item?id=47601608 Thank you so much for the coffee offer, that genuinely made my day! I don't have a sponsor link set up. Honestly, the best support is just hearing if this actually helps you ship your personal project faster without losing your mind to prompt engineering. I really hope it gives you your sanity back. Let me know how it goes!

Melatonic 80 days ago

Some of your comments have already been marked as "dead" oddly enough that just seemed like normal comments explaining your rationale.

edit: Also seems like peoples replies are getting downvoted to hell and getting marked as dead and dissapear. Someone must not like your idea :-)

zargon 80 days ago

Comments are marked dead by automatic processes, not through downvotes. They're dead before anyone sees them, and you can't vote on a dead comment. amangsingh's comments have probably triggered some automated moderation. Probably at least partially because they sound LLM-generated.

fallinditch 80 days ago

Sounds good, I'll keep an eye out.

amangsingh 80 days ago

Just dropped the Show HN here: https://news.ycombinator.com/item?id=47601608. Would love to hear your thoughts on the architecture!

skeledrew 80 days ago

There's nothing at that link. Not even a title.

Melatonic 80 days ago

Looks like it was downvoted to hell and marked as dead super fast. I leave the flag for "dead" on in my HN settings (leaves it super desaturated) and this seems unusual

marcuscog 80 days ago

I think these folks are attempting to build systems with IAM, entity states, business rules: all built over two foundational DSLs - https://typmo.com

mbesto 80 days ago

Thousands of developers are using Claude Code successfully (I think?).

So what specifically is the gripe? If it works, it works right?

ap99 80 days ago

So this is more like an art than science - and Claude Code happens to be the best at this messy art (imo).

p-e-w 80 days ago

> A 500k line codebase for an agent CLI proves one thing: making a probabilistic LLM behave deterministically is a massive state-management nightmare.

Considering what the entire system ends up being capable of, 500k lines is about 0.001% of what I would have expected something like that to require 10 years ago.

You can combine that with all the training and inference code, and at the end of the day, a system that literally writes code ends up being smaller than the LibreOffice codebase.

It boggles the mind, really.

davidkunz 80 days ago

Oh, you should have a look at Pi then.

https://github.com/badlogic/pi-mono/tree/main/packages/codin...

sarchertech 80 days ago

> You can combine that with all the training and inference code, and at the end of the day, a system that literally writes code ends up being smaller than the LibreOffice codebase.

You really need to compare it to the model weights though. That’s the “code”.

pixl97 80 days ago

>You really need to compare it to the model weights though

Then you'd need to compare the education of any developer in relation to how many LOC their IDE is. That's the "code".

So yea, the analogy doesn't make a whole lot of sense.

oblio 80 days ago

It even wrote an entire browser!

By "just" wrapping a browser engine.

raincole 80 days ago

... what are you even talking about? "The system that literally writes code" has a few hundreds of trillions of parameters. How is this smaller than LibreOffice?

I know xkcd 1053, but come on.

bwfan123 80 days ago

brute-forcing pattern-matching at scale. These are brittle systems with enormous duct-taping to hold everything together. workarounds on workarounds.

ramesh31 80 days ago

>A 500k line codebase for an agent CLI proves one thing: making a probabilistic LLM behave deterministically is a massive state-management nightmare. Right now, they're great for prompting simple sites/platforms but they break at large enterprise repos.

Is that the case? I'm pretty sure Claude Code is one of the most massively successful pieces of software made in the last decade. I don't know how that proves your point. Will this codebase become unmanageable eventually? Maybe, but literally every agent harness out there is just copying their lead at this point.

amangsingh 80 days ago

Claude code is a massively successful generator, I use it all the time, but it's not a governance layer.

The fact that the industry is copying a 500k-line harness is the problem. We're automating security vulnerabilities at scale because people are trying to put the guardrails inside the probabilistic code instead of strictly above it.

Standardizing on half a million lines of defensive spaghetti is a huge liability.

ramesh31 80 days ago

>Standardizing on half a million lines of defensive spaghetti is a huge liability.

Again, maybe it will be. Or maybe the way we make software and what is considered good practice will completely change with this new technology. I'm betting on the latter at this point.