| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cobertos 53 days ago

This post just gives me more questions than answers and I'm unable to form a decision:

* Why was v3.4.1 the most buggy, right before the Claude commits? Why did "nobody notice"? It's way to strange to just say welp, it must be human error. * Why does v3.4.2 have 0 bugs, or 0 bug score. And why was such an outlier (no other commit seemingly has this??) allowed to mix into aggregate statistics and bring all the "is Claude buggy?" scores down. Tbh idk how that _wasn't_ a red flag in the author's analysis...

This article feels like half of an analysis presented as a highly complex finished product due all the advanced stats they're running.

2 comments

logicprog 53 days ago

> Why was v3.4.1 the most buggy, right before the Claude commits? Why did "nobody notice"? It's way to strange to just say welp, it must be human error.

Why wouldn't it be except question begging priors assuming it couldn't be?

> Why does v3.4.2 have 0 bugs, or 0 bug score. And why was such an outlier (no other commit seemingly has this??) allowed to mix into aggregate statistics and bring all the "is Claude buggy?" scores down.

My original metrics which didn't filter out feature requests and questions had it at four bugs and prior to that it was even higher and it didn't make much of a difference to the overall analysis (fell well within the IQR, the lower end of it too). Also, removing one outlier just because it looks kind of funny to you, especially when we only have two Claude releases at all, would be worse in my opinion and more arbitrary.

link

cobertos 53 days ago

> Why wouldn't it be except question begging priors assuming it couldn't be?

A multitude of reasons? A change in maintainer. A change in the mental state of a maintainer. A sudden focus by the community on a given undesirable behavior. Someone else here suggested use of Claude AI before it was disclosured. The framing implies that it was human-produced coding error, but my point is it could be _any other human error_ or even just some odd benign human behavior (a stampede of bug submitters), affecting the data. Which does not lead to the conclusion that AI code > human code. Not looking at these potentials is so unsatisfying.

> My original metrics which didn't filter out feature requests...

It still feels like a lot of weight of the phrase "If that doesn't look like a red flag to you, you'd be right." hinges on the fact that one of the versions has 0 bugs and it really killed the weight of that statement for me, because the oddity of there being 0 bugs just wasn't explained.

---

Could you please post the duckdb file that has the raw bug -> severity + version mapping to the GitHub repo? I have a desire to dig into this myself

link

logicprog 53 days ago

I'll do that when I get a chance

link

Laurel1234 53 days ago

> Tbh idk how that _wasn't_ a red flag in the author's analysis...

Because he didn't analyze shit, just asked a clanker to rationalize his "clankers are great" conclusion.

link