Hacker News new | ask | show | jobs
by vidarh 985 days ago
Last night I asked ChatGPT to explain to me how to build an OIDC provider using a specific Ruby gem. It gave me a solution that mostly ignored the gem. I pointed this out, it explained that the gem didn't actually exactly provide much that would reduce the code size, and offered up a version that used it more extensively anyway.

It used what I told it both in the original case, and gave me reasoning for why not using it much was a decent choice (and I verified that it was right), and showed me with an example that demonstrated it was able to reason about how my feedback related to the original answer and apply it. Later it went on, as a result of a subsequent question, and fleshed out the rest of the process. Everything it gave me worked.

To me that is a clear example that while it certainly fails to apply concepts fairly often (and often writes broken code), in other cases it does well. I'll add that this was after I'd spent some time searching for examples and I found nothing like what I suggested and I was about to resign myself to a slog through a lot of really bad documentation, and searching for some of what it suggested afterwards as well made it clear it did not just crib from training data.

For me, this is an example of it reasoning better about the subject than a whole lot of people I found discussing this subject in forum posts I came across, who often made mistakes the code it gave me did not or made assumptions that the code ChatGPT gave me made clear were wrong (as I could verify from the fact it worked)

On the other hand it struggles with something as simple as addition of large numbers that a determined child could do.

Nobody will claim it's consistently reasoning well. But I also regularly see it reason better than a lot of people I know about specific subjects, and so it's exasperating to see people dismiss individual examples of failure as evidence it "cannot apply concepts properly" rather than as individual datapoints.

People both over- and under-estimate how well it can reason based on the types of problems they put to it, and it's certainly an interesting subject how to gauge an "alien intelligence" like this that is so uneven in areas where we expect a relatively even basis and so have all kinds of heuristics for whether someone "knows".

This is part of the problem: We've all gone through a childhood and while we've picked up different things, we mostly have a shared floor that is relatively even across a wide range of basic skills. LLMs don't have that, and that messes with peoples heads. Those of us who have gone into skilled professions similarly have all kinds of preconceptions about what a junior or senior developer looks like, for example, and LLMs do not fit neatly into those boxes.

They're dumb as small children in some areas, but still talk confidently about those subject as if they were an educated adult. That is a challenge and a problem. But that doesn't mean they're not able to reason about other subjects. Just not all of them.

1 comments

Couple of points:

For me that points to reasoning happening by replication of sorts of often poor human output, but not by having a "mechanic" way to reason. As I said, humans are often poor at reasoning.

I also think code creation isn't a good area because it is narrower and more mechanically linked by probability than a lot of other areas (so token probability is potentially more informative). I could be wrong there, though.

What do you even mean by "mechanic" way to reason here?

And what do you expect it'd replicate? As I wrote, I tried looking to see if there were similar pieces of code online, and came up empty. I did that exactly because I was curious about the huge gap in quality between what I'd found before and what GPT4 came up with. Not least because it certainly is not something that happens every time.

> I also think code creation isn't a good area because it is narrower and more mechanically linked by probability than a lot of other areas (so token probability is potentially more informative).

I don't see why that would make it worse. Not least because it also makes it far easier to evaluate the outcome. If anything, we ourselves grasp for formalisms and structure when we want to ensure our reasoning is sound.

Again your use of "mechanically" here also makes absolutely no sense to me.

No, sorry, I view code creation as easier than other things.

I meant it replicates generally poor human reasoning capabilities but there is no general method to reason something out (because token probabilities are not informative to that end). You can train humans somewhat to that end, but not easy.

> No, sorry, I view code creation as easier than other things.

Then we will get nowhere, as it's trivially easy to stump even above averagely intelligent people with problems revolving around reasoning about code.

To me you've then set the bar at a level the vast majority of people can't meet and that's utterly absurd.

And code is just formalised language.

Formalised stuff might favor probabilistic approaches - that was my point.

Anyway, I think "intelligence" and "reasoning" or not always the same to start with.

Why is setting bar high absurd? It is the same way I demand my pocket calculator to be so much better than humans at calculating things.

Firstly, there's absolutely no evidence whatsoever for that hypothesis. But secondly, this is also poor reasoning. Your argument boils down to implying that if it's good at X, then there might be a bias in its favour with respect to X that makes it a poor judge of whether it's reasoning. By extension, making that argument engages in the logical fallacy of begging the question (assuming the conclusion).

> Why is setting bar high absurd? It is the same way I demand my pocket calculator to be so much better than humans at calculating things.

This is also poor reasoning. We demand the pocket calculator be better because otherwise we would have no use for it. It would be logically invalid to argue that if it was merely as capable as a human, maybe one that is not very good but still able to calculate, that it is unable to calculate at all.