| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wavemode 114 days ago

> the kind of analysis the program is able to do is past the point where technology looks like magic. I don’t know how you get here from “predict the next word.”

You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data. That assumption is usually faulty - very few of the ideas and concepts we come up with in our everyday lives are truly new.

All that being said, the refine.ink tool certainly has an interesting approach, which I'm not sure I've seen before. They review a single piece of writing, and it takes up to an hour, and it costs $50. They are probably running the LLM very painstakingly and repeatedly over combinations of sections of your text, allowing it to reason about the things you've written in a lot more detail than you get with a plain run of a long-context model (due to the limitations of sparse attention).

It's neat. I wonder about what other kinds of tasks we could improve AI performance at by scaling time and money (which, in the grand scheme, is usually still a bargain compared to a human worker).

3 comments

jjmarr 114 days ago

I created a code review pipeline at work with a similar tradeoff and we found the cost is worth it. Time is a non-issue.

We could run Claude on our code and call it a day, but we have hundreds of style, safety, etc rules on a very large C++ codebase with intricate behaviour (cooperative multitasking be fun).

So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.

"scaling time" on the other hand is useless. You can just divide the problem with subagents until it's time within a few minutes because that also increases quality due to less context/more focus.

aktau 114 days ago

Any LLM-based code review tooling I've tried has been lackluster (most comments not too helpful). Prose review is usually better.

> So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.

Sure, you could make multiple LLM invocations (different temporature, different prompts, ...). But how does one separate the good comments from the bad comments? Another meta-LLM? [1] Do you know of anyone who summarizes the approach?

[1]: I suppose you could shard that out for as much compute you want to spend, with one LLM invocation judging/collating the results of (say) 10 child reviewers.

DustinKlent 114 days ago

I have attempted to replicate the "workflow" LLM process where several LLMs come up with different variations of a way to solve a problem and a "judge" LLM reviews them and the go through different verification processes to see if this workflow increased the accuracy of the LLM's ability to solve the problem. For me, in my experiments, it didn't really make much difference but at the time I was using LLMs significantly dumber than current frontier models. HOWEVER...When I enable "Thinking Mode" on frontier LLM's like ChatGPT it DOES tend to solve problems that the non-thinking mode isn't able to solve so perhaps it's just a matter of throwing enough iterations at it for the LLM to be able to solve a particular complex problem.

ivansavz 113 days ago

> But how does one separate the good comments from the bad comments?

One thing that works very well for me (in a different context) is to ask to return two lists:

- Things that I must absolutely fix (bugs, typos, logic mistakes, etc.)

- Lesser fixes and other stylistic improvements

Then I look only at the first list.

jjmarr 113 days ago

You need human alignment on what constitutes a "good" comment. That means consistent rules.

Otherwise, some people feel review is too harsh, other people feel it is not harsh enough. AI does not fix inconsistent expectations.

> But how does one separate the good comments from the bad comments?

If the AI took a valid interpretation of the coding guidelines, it is a legitimate comment. If the AI is being overly pedantic, it is a documentation bug and we change the rules.

smallpipe 114 days ago

> This has completely replaced human code review for anything that isn't functional correctness

Isn’t functional correctness pretty much the only thing that matters though?

grey-area 114 days ago

Well no, style is important too for humans when they read a codebase, so the LLMs the parent is running clearly have some value for them.

They're not claiming LLMs solved every problem, just that they made life easier by taking care of busywork that humans would otherwise be doing. I think personally this is quite a good use for them - offering suggestions on PRs say, as long as humans still review them as well.

1718627440 114 days ago

But isn't style already achievable by running e.g. GNU indent?

jjmarr 113 days ago

Some examples of complex transformations linters can't catch:

* Function names must start with a verb.

* Use standard algorithms instead of for loops.

* Refactor your code to use IIFEs to make variables constexpr.

The verb one is the best example. Since we work adjacent to hardware, people like creating functions on structs representing register state called "REGISTER_XYZ_FIELD_BIT_1()" and you can't tell if this gets the value of the first field bit or sets something called field bit to 1.

If you rename it to `getRegisterXyzFieldBit1()` or `setRegisterXyzFieldBitTo1()` at least it becomes clear what they're doing.

Kim_Bruning 114 days ago

> You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data. That assumption is usually faulty - very few of the ideas and concepts we come up with in our everyday lives are truly new.

I made a cursed CPU in the game 'Turing Complete'; and had an older version of claude build me an assembler for it?

Good luck finding THAT in the training data. :-P

(just to be sure, I then had it write actual programs in that new assembly language)

withinboredom 114 days ago

But the ideas are not 'new'. A benchmark that I use to tell me if an AI is overfitted is to present the AI with a recent paper (especially one like a paxos variant) and have it build that. If it writes general paxos instead of what the paper specified, its overfitted.

Claude 4.5: not overfitted too much -- does the right thing 6/10 times.

Claude 4.6: overfitted -- does the right thing 2/10 times.

OpenAI 5.3: overfitted -- does the right thing 3/10 times.

These aren't perfect benchmarks, but it lets me know how much babysitting I need to do.

My point being that older Claude models weren't overfitted nearly as much, so I'm confirming what you're saying.

Kim_Bruning 114 days ago

Could also be that the model has stronger priors wrt Paxos (and thus has Opinions on what good Paxos should look like)

At any rate, with an assembler, you end up with a lot of random letter-salad mnemonics with odd use cases, so that is very likely to tokenize in interesting ways at the very least.

withinboredom 114 days ago

I was just using paxos as an example. Any paper will do.

selridge 114 days ago

>You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data.

This is just as stuck in a moment in time as "they only do next word prediction" What does this even mean anymore? Are we supposed to believe that a review of this paper that wasn't written when that model (It's putatively not an "LLM", but IDK enough about it to be pushy there) was trained? Does that even make sense? We're not in the regime of regurgitating training data (if we really ever were). We need to let go of these frames which were barely true when they took hold. Some new shit is afoot.

wavemode 114 days ago

Statistical models generalize. If you train a model that f(x) = 5 and f(x+1) = 6, the number 7 doesn't have to exist in the training data for the model to give you a correct answer for f(x+2)

Similarly, if there are millions of academic papers and thousands of peer reviews in the training data, a review of this exact paper doesn't need to be in there for the LLM to write something convincing. (I say "convincing" rather than "correct" since, the author himself admits that he doesn't agree with all the LLM's comments.)

I tend to recommend people learn these things from first principles (e.g. build a small neural network, explore deep learning, build a language model) to gain a better intuition. There's really no "magic" at work here.

kristiandupont 114 days ago

I had Claude help me get a program written for Linux to compile on macOS. The program is written in a programming language the author invented for the project, a pretty unusual one (for example, it allows spaces in variable names).

Claude figured out how the language worked and debugged segfaults until the compiler compiled, and then until the program did. That might not be magic, but it shows a level of sophistication where referring to “statistics” is about as meaningful as describing a person as the statistics of electrical impulses between neurons.

compass_copium 114 days ago

But the programming language has explicitly laid out rules. It was not trained on those sets of rules, but it was trained on many trillions of lines of code. It has a map of how programs work, and an explanation of this new language. It's using training data and data it's fed to generate that result.

selridge 114 days ago

What doesn't that explain tho?

What behavior would you need to see for that explanation to no longer hold? Because it seems like it explains too much.

BobaFloutist 113 days ago

I don't know how you'd prompt this, but if there was a clean example of an A.I. coming up with an idea that's completely novel in more than details, it would be compelling evidence that these next-token predictors have some weird emergent properties that don't necessarily follow from intricate, sophisticated webs of token-prediction.

E.g. "What might be a room-temperature superconductor" -> "some plausible iteration on existing high-temperature superconductors based on our current understanding of the underlying physics" would not be outside how we currently understand them.

"What might be a room-temperature superconductor?" -> "some completely outlandish material that nobody has studied before and, when examined, seems to have higher temperature superconducting than we would predict" would provoke some serious questions.

A fun experiment I've heard suggested is training a model on all scientific understanding just up to some counterintuitive quantum leap in scientific understanding, say, Einstein's theory of relativity, and then seeing if you can prompt it to "discover" or "invent" said leap, without explicitly telling it what to look for. This would of course be pretty hard to prove, but if you could get it to work on a local model, publish the training set and parameters so that anyone can replicate it on their own machine, that could be pretty darn compelling.

compass_copium 113 days ago

Programs are fundamentally lists of instructions. LLMs are very good at building these lists. That it performs well when you say "Build a list you've seen before, but do it in a slightly different way this time. Here's the exact way I want you to do it." is not surprising. I would honestly be surprised if it couldn't do it.

As the other commenter suggested, a genuinely novel scientific idea would be surprising. A new style of art (think Picasso or Pollack coming along), not just an iteration on Ghibli, would be surprising. That's actual creativity.

orf 114 days ago

That’s still over-general to the point of being useless.

What you wrote would apply to a human approaching this task as well, sans the “many trillion lines of code”.

c22 114 days ago

> If you train a model that f(x) = 5 and f(x+1) = 6, the number 7 doesn't have to exist in the training data for the model to give you a correct answer for f(x+2)

This is an interesting claim to me. Are there any models that exist that have been trained with a (single digit) number omitted from the training data?

If such a model does exist, how does it represent the answer? (What symbol does it use for the '7'?)

wavemode 114 days ago

When I say "model" here I'm referring to any statistical model (in this example, probably linear regression). Not specifically large language models / neural networks.

c22 114 days ago

Gotcha, I don't think I know enough about it. What constitutes training data for a for a (non neural network) statistical model? Is this something I could play around with myself with pen and paper?

nairboon 114 days ago

Just the raw numbers? You list the y's and the x's and the model is approximating y=f(x) from the above example. You can totally do it with pen and paper. This is what it'd look like (for linear regression): https://observablehq.com/@yizhe-ang/interactive-visualizatio...

heavyset_go 114 days ago

You can write an f(x) and record the input and output and that can be your training data. Or just download some time-series data or something.

Kim_Bruning 114 days ago

If you run an LLM in an autoregressive loop you can get it to emulate a turing machine though. That sort of changes the complexity class of the system just a touch. 'Just predicts the next word' hits different when the loop is doing general computation.

Took me a bit of messing around, but try to write out each state sequentially, with a check step between each.

ainch 114 days ago

Sorry but this is famously not true! There is no guarantee that statistical models generalise. In your example, whether or not your model generalises depends entirely on what f(x) you use - depending on the complexity of your function class f(x+2) could be 7, 8, or -500.

One of the surprises of deep learning is that it can, sometimes, defy prior statistical learning theory to generalise, but this is still poorly understood. Concepts like grokking, double descent, and the implicit bias of gradient descent are driving a lot of new research into the underlying dynamics of deep learning. But I'd say it is pretty ahistoric to claim that this is obvious or trivial - decades of work studied "overfitting" and related problems where statistical models fail to generalise or even interpolate within the support of their training data.

arkh 114 days ago

I expected (and still expect) a lot from LLM with cross disciplinary research.

I think they should be the perfect tool to find methods or results in a field which look like it could be used in another field.

WithinReason 114 days ago

This might actually be a limitation of the "predict next word" approach since the network is never trained to predict a result in one field from a result in another. It might still make the connection though, but not as easily.

red75prime 114 days ago

I think the relevant question is: can a statistical model (or a transformer, in particular) generalize to general reasoning ability?

selridge 114 days ago

Ok cool cool. Instead of pretending you need to teach me, you could engage with what I'm saying or even the OP!

"I don't know how you get here from "predict the next word"" is not really so much a statement of ignorance where someone needs you to step in but a reflection that perhaps the tech is not so easily explained as that. No magic needs to be present for that to be the case.

wavemode 114 days ago

If you disagree with someone on the internet, you can just say "I disagree, and here's why". You don't have to aggressively accuse them of "not engaging" with the text.

I engaged. You just don't like what I wrote. That's okay.

selridge 114 days ago

Thanks but no thanks.

anon7725 114 days ago

“Represented in the training data” does not mean “represented as a whole in the training data”. If A and B are separately in the training data, the model can provide a result when A and B occur in the input because the model has made a connection between A and B in the latent space.

selridge 114 days ago

Yes. I’m saying that “it’s just in the training data” is a cognitive containment of these models which is incomplete. You can insist that’s what’s happening, but you’ll be left unable to explain what’s going on beyond truisms.

WithinReason 114 days ago

It's called "generalization":

https://en.wikipedia.org/wiki/Generalization_(learning)

selridge 113 days ago

>"If A and B are separately in the training data, the model can provide a result when A and B occur in the input because the model has made a connection between A and B in the latent space."

This statement (The one I was replying to) is fundamentally unbounded. There's nothing that can't be explained as a combination of "A" and "B" in "training data" because practically speaking we can express anything as such where the combination only needs to be convex along some high-dimensional semantic surface. Add on to that my scare quotes around "training data" because very few people have any practical idea of what is or isn't in there, so we can just make claims strategically. Do we need to explain a success? It was in the training data. A failure, probably not in the training data. Will anyone call us on this transparent farce? Not usually, no.

If a statement can--at will--explain everything and nothing, what's it worth?