The article listed explains how to avoid this. If you naively turn it loose on a big code base, yes, you’ll burn a lot of tokens while it tries to find stuff.
This is such a shame, finding where stuff is in a large codebase is my number 1 use for LLM. I hate it that it relies on grep so much, I can do grep better and faster myself.
I mean, it doesn't have to be a shame. I typically will have it start making an index as it probes, so the big token burn is upfront, and future searches are more targeted. It is pretty good at doing this.
If I set a regular expression as watcher on a filesytem to notify me if any file changes and I write that in go and assuming regular expression isn't buggy nor its implementation - and then I write rules in a file (as regex) then there's snowball in hell of a chance that it would misnotify or miscategorize anything.
Are LLMs that super reliable in their output already with all the guardrails around?
Don't think so. Hence it is snake oil just like dozens of harnesses.
It might behave differently than specified and a human is required to validate every output carefully or else.
> Are LLMs that super reliable in their output already with all the guardrails around?
Well, what is your definition of "super reliable in the output", and is it a quantifiable/measurable target or just a feeling?
Is it "more than humans", "more than senior developers", "almost perfect", "perfect"?
> It might behave differently than specified and a human is required to validate every output carefully or else.
Sure, just like meatbag developers. All the security flaws AI finds today were introduced years/decades ago by humans and haven't been found (that we know) by humans in ages.
printf("I'll count up to %d", MAX_COUNT);
for(int i=1; < MAX_COUNT; i++)
printf("I'm now counting %d", i);
```
And of the following prompt:
```
You'll count to 10,000. At the start say "I'll count up to 10,000" and then for each number say "I'm now counting <number>" and do not say anything else. Do not miss numbers in between.
```
Which one is going to produce 100% correct results out of a 10,000 run of each?
Now don't give me "these are different tools". We all know. I'm talking about reliability and predictability.
Well, for starters the program you wrote is wrong (very unreliable) 100% of the time (very predictable)... so you just got your answer I guess.
In any case, most -if not nearly all- of the top-100 LLM will answer your prompt with some code that does what you intended the first program to do. Only they'll actually code it properly of course.