Hacker News new | ask | show | jobs
by aesthesia 6 days ago
I don't have a dog in this fight, but a few points that look a little suspicious:

- The release with the highest number of attributed bugs is the release _right before_ the first release with Claude-coauthored commits, released in January; is there a chance that unattributed LLM-authored commits made it into this release?

- The release attribution methodology is not great, since it will tend to attribute bugs introduced in a minor version update to the longest-lived patch release of that minor version. I doubt that 3.4.1 actually introduced a lot of bugs, but since it was released a day after 3.4.0, bugs that were introduced in that release get attributed to 3.4.1.

- Relatedly, more recent releases have had less time to have bugs filed against them, so there may be a bit of a bias toward evaluating recent releases as less buggy.

5 comments

Agree. From the article:

> Here's my favorite part, though. Digging into the data, one of the first things that jumped out at me with blinding clarity was that the worst release, by far, in rsync history was entirely prior to the introduction of Claude ... And yet nobody noticed.

Language really does suggest the article's author does have a dog in this fight and is cloaking opinion in fancy statistics jargon. "Blinding clarity"? All you have to do is draw a plot. And anyway, v3.4.1 was 2025-01-16, technically well within the AI assisted coding era and before attribution was becoming standard practice.

Also from the article:

> "Claude clearly made things worse" &emdash; the main claim

This article was clearly generated by AI, yet I found no mention/attribution of that by author.

How likely is it than someone who vibe codes articles would also vibe code the underlying analysis and be eager to accept an outcome that is highly validating of that person’s workflow? I’d say very.

The &emdash; is probably human error, other parts of the HTML correctly use — or Unicode em-dashes. Also: https://github.com/alexispurslane/rsync-analysis/commit/740b...
He did admit as much:

> "The scripts used to fetch the data, collate it into a DuckDB database file, construct the views on that DB, and then do the statistical analysis on that data, were indeed written by GLM 5.1, as was the HTML and much of the original prose for the final report webpage you're looking at right now."

But: "After posting this on Hacker News and recieving [sic] almost no substantive input, discussion, or response on the actual content of the article, I decided to rewrite all of the prose in my own voice. If anyone complains about my verbosity or sentence structure — as they usually do, which is the reason I originally let the AI write the prose, among other reasons obsoleted by templating — they can go fuck themselves."

So rewritten in his own voice. Maybe the m-dashes are from GLM, maybe from the author.

Are the numbers wrong? That's the only relevant thing here.

Also, humans do use em dashes, just FYI.

> Are the numbers wrong? That's the only relevant thing here.

Data without interpretation is irrelevant, and correct numbers can be interpreted wrongly, either on purpose or by mistake.

I’m not saying any of that happened here, only that “are the numbers wrong” is not the only thing that is relevant.

> humans do use em dashes, just FYI.

Your parent comment is not complaining about em dashes, they are pointing out the article has a literal “&emdash;” in it.

Yes, I do for example.

And the author discussed the use of AI pretty exhaustively in point 0 of the post.

You can use LLMs in multiple ways, from very hands on to make local changes to completely hands-off.

I've seen plenty of code that was LLM generated but the commit message itself did not have the co-author attached to it. This only seems to happen when someone's interface to the codebase is completely though Claude/Codex/..., and those are usually the most verbose commits, and yet they say the least, because they just summarize the code changes, not the why.

On the other hand I've seen developers using Claude as a tool. They have VSCode open and a terminal window with Claude and go back and forth, ensuring they write correct code, and leave the plumbing to Claude.

So maybe the author of the code started off small and it grew over time?

I would expect a mature code base like rsync to have a lot of unit tests and integration tests and frankly if there's not enough that such bugs haven't been caught; that should be your first use of LLMs in order to setup some deterministic guidelines when you do start making changes to your actual code.

I have been experimenting with both aforementioned styles with interesting results.

> I would expect a mature code base like rsync to have a lot of unit tests and integration tests

You might be surprised. C applications which interact heavily with the system - like rsync - can be tricky to test comprehensively, as it's nontrivial to inject faults into system calls. If the application is architected to support this kind of testing, or uses a HAL, that may make matters easier - but an older codebase like rsync probably isn't.

I've had a local LLM spending weeks trying to write tests. then debug those tests. then write antipatterns and patterns for those tests.

It's amusing. It's not terrible, but tests arn't going to save you from a malicious tester.

Your first and second points seem to contradict each other because if all of the bugs for 3.4.1 should be attributed to 3.4.0, that pushes the timetable back even further that unattributed LLM commits would have to have been being committed to the project, which just makes your point even more absurd.

Which brings me to my overall response, which is that there is absolutely no evidence, and nothing even intimating this hypothesis, that LLM commits were secretly being added to earlier releases before they were attributed, and that's why the rate of bugs is higher. There's no reason to think that it's an unreasonable thing to think, and there's no evidence for that whatsoever unless you beg the question and assume that higher bug counts must automatically indicate AI involvement, which is just circular reasoning. You're essentially just making up a hypothesis out of thin air to preserve your point.

Regarding your third point, that one's fair, but I've done the analysis and I can put it up if you want, as to how long it usually takes to find bugs and how far through the release cycle we are for each version.

Sorry, I should have said this explicitly in the original comment: I think you're likely _correct_ that there isn't a clear increase in the rate of bugs attributable to LLM-authored code in rsync. Your analysis provides evidence in this direction; these are just the things that made me go "hmm". They're not accusations or claims that the conclusion is invalid. But they're definitely things to be curious about.

Regarding unlabeled LLM-authored commits, I don't think it's unreasonable in general to think that an open-source project might have had unlabeled LLM-authored commits at some point before 2026. Looking more closely at rsync's recent commit history, I think it's less likely in this case. There's just a low number of commits in general, _until_ large batches of Claude-authored commits start showing up early this year. But this then raises some questions about the bugs-per-commit metric; it does correct for something like "size of release", but also obscures a significant shift in commit velocity that may be downstream of adding LLM development tools to the workflow.

Like I said, I don't have a dog in this fight, and I try not to approach sorts of questions from a position of explicit advocacy. I do think it's an interesting question, though, and we should try to understand what the data is actually telling us.

Isn't the metric that you've used "bugs per commit ~ per new line of code" going to miss the issue?

All code is technical debt.

If rsync releases used to have 500 lines changed and 5 bugs in and AI-powered rsync releases have 50000 lines and 500 bugs, it's the same bugs/line but much worse experience for the user?

I've not looked into the details of this case and I do use AI assistance coding at work but in my experience, the problem is that it's too easy to write lots of code and therefore hard to review the huge volumes of code and this analysis will ignore that?

edit: actually your table shows there weren't unusually large numbers of commits in this release, so perhaps my initial skepticism shows a bias I have?

OpenBSD used to have sqlite in base, but the code churn rate was too high to review. This was well before the recent LLM craze, so a human (perhaps not a normal one, though) already sufficies to generate too many changes for others to check for errors.
I started to look into the same thing considering releases are quite infrequent. To avoid the issue of unattributed LLM-authored commits, in my opinion the analysis should include a comparison to bug severity before and after release v3.3.0 (date April 6th, 2024)
Let's start with most outright alarming error - the claude statistics are taken out of whole 2 data points
That's sort of the point. There isn't enough data to extrapolate, and yet that's exactly what those outraged about AI were doing, and when you do do the very minimal types of analyses (permutation tests, and looking at distributions, mostly) that are actually valid, safe, standard, and useful to do on such low amounts of date, again, no evidence for the outrage shows up, and the two releases look so normal that it sort of shows no one would've cared if they hadn't known or found out that Claude was involved.

I really think this a much better standard of evidence — limited though it is — to outrage-fueled cherry-picked anecdotes, which is what has been driving this whole thing. If you disagree, and think the outrage should go one when I've shown there's an absence of evidence entirely for it (although of course, that's not evidence of absence; maybe I'll have to eat my words 5 releases down the line, but appealing to that now feels like a Russell's Teapot), would you care to explain why?

I know you’re defending your work here but this behavior does absolutely nothing to help your point.
Fair point. Let me edit (if I still can) to tone it down.
you could've literally just waited few more releases, but no, have to catch the hype wave before news are cold

> that are actually valid, safe, standard, and useful to do on such low amounts of date,

if you presented paper with that amount of data points you'd be laughed out of the room

The interpretations of the p-value is also alarming. One of the first thing they teach you in statistics class is: “an absence of evidence is not evidence of absence”.

This analysis showed that there is indeed an absence of evidence, but it concludes there is evidence of absence.

Traditional p-hacking is done by oversampling and overtesting. If you do 20 analysis on average one will show p < 0.05 by random chance. This analysis is doing the inverse of that. Under-sampling, and concluding with p > 0.05

> This analysis showed that there is indeed an absence of evidence, but it concludes there is evidence of absence.

I tried pretty hard to avoid saying that, can you point me at how to rephrase? The point I'm trying to make is just that there is absolutely no evidence at all for what people are saying with such absolutism and claimed objectivity (that Claude made rsync worse), and thus it doesn't justify the outrage.

> Under-sampling, and concluding with p > 0.05

How would I avoid under-sampling here? And if you're going to say it's because I only have 2 data points, well, the side making the positive claim — that Claude made rsync worse — only had two as well, and unremarkable ones at that, as I've tried very hard to show.

You are interpreting the p-values on their own merit rather then using them to test a null-hypothesis. Quotes like:

> With a p-value of 74%, the answer is a decisive no. The odds ratio is 1.06 — essentially 1:1. Claude releases are no more likely to be above the median than any other releases.

are problematic in this context as the correct conclusion here is you just don‘t have enough data conclude whether or not you are more likely to encounter a bug after a Claude commit.

> How would I avoid under-sampling here?

You don‘t. You admit that you don’t have enough data and move on. What you are trying to do here is prove a negative, which is extremely hard to do. In your discussion you claim that the users complaining had no right to, however nothing in your analysis showed they were wrong. We simply don‘t have enough data (yet) to say either way. When we have enough data they may be proven right or wrong, but until then, we cannot conclude either way.

If you insist still, I recommend looking into bayesian analysis. Theoretically at least the posterior distribution from a bayesian analysis can be interpreted directly and analyses on its own merits. However I suspect your posterior will have way too much uncertainty to reach any conclusions.

Edited that claim, and made several clarifications elsewhere. The whole point of this analysis is that outrage is unjustified on the basis of two totally statistically unremarkable releases that no one would have remarked on pre-AI (my further proof of this is that there was a pre-AI remarkably broken release, and no one did comment!) and zero positive evidence outside cherry-picked anecdotes for any negative impact. We should wait for outrage and version pinning and cancelation until there is evidence, no? I'm just trying to say that these specific releases are unremarkable, and there's no evidence at all of harm currently; I'm not trying to build any kind of predictive model for future Claude releases to say anything grander than "these specific releases are fine, what are we freaking out about?", not some claim about what Claude-exposed releases will look like or trend like in the future or in general.
There is a lot more context to the outrage which is missing from your analysis. People have multiple reasons to be mad at AI usage, you mention some of them in your introduction, and you put a (statistically insignificant) measure on only one of them. In your analysis you have shown that exactly one of these reasons is anecdotal. That does not mean they are wrong, and it especially does not mean they are unjustified.

That you found a single pre-AI release which did not cause outrage is proof of nothing. This single release is equally anecdotal, and statistically insignificant.

So, the biggest context that is missing here is that people hate AI for various reasons, and they don‘t want their favorite tools to fall victim to AI for equally many reasons. It is only natural that people who hate AI react this way when they find out their favorite tool uses AI, and doubly so when they sniff correlation between their favorite tools use of AI and bugs.

> I'm just trying to say that these specific releases are unremarkable, and there's no evidence at all of harm currently.

Well, there is no evidence against harm either. But what you did here is a bit of a slight of hand. In your analysis your null hypothesis is: “There is no difference in bug count between releases which includes code commits from Claude Code and releases which don‘t”. (You then go about doing what every psychology major is taught not to do; find evidence for the null hypothesis, not against it). However what hypothesis testing is for is to use a representative sample to generalize over a wider population. You do hypothesis testing because you want to demonstrate that your sample is representative of a wider population, that you just so happened to have picked the two sample, by random chance, which shows the effect regardless of the experiment.

By calculating the p-values you were telling me that you were in fact ready to make generalizing statements over a wider population of commits, but your results were statically insignificant, so really you should not draw any conclusions from them. You have not, in fact, shown that they aren’t different from the rest of the population.

The concept you need here is "Statistical Power".

The ELI5 version is that there are two mistakes you can make when looking at a P value:

Type I error, where your P value is falsely low. In the experiment being discussed here, it would lead one to conclude that AI code is worse. Otherwise known as a false positive.

Type II error, where your P value is falsely high, leading you to conclude that AI code is no different. Otherwise known as a false negative.

https://en.wikipedia.org/wiki/Power_(statistics)

One can calculate statistical power for a given experimental protocol.

My hunch is that if you did this, you would find this experiment is grossly under-powered.

This means you can't make the "absence of evidence" claim.

He can't make the evidence of absence claim, but he can absolutely make the absence of evidence claim.
Perhaps in an “everyday language” way, but not in the technical, statistical sense.

In an underpowered statistical study, a claim that two experimental conditions did not differ are not persuasive.

No. It's a description of the result of the maybe underpowered study. the underpowered study did not find evidence. Evidence is absent. Because it is underpowered, it's not evidence that the effect is absent.

The claim is not "two experimental conditions did not differ". The claim is "The data do not show evidence that the experimental conditions did differ".

If one asks "Is the house on 123 Road Street, NJ, taller than the statistical average", then that there is only 1 datapoint for the house on 123 Road Street, NJ. Which is also 100% of the houses on 123 Road Street, NJ.
You can apply that to the outrage too: the people pissed off about this are going off 2 measly data points.