Hacker News new | ask | show | jobs
by SlinkyOnStairs 98 days ago
> I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo.

I disagree in the case of LLMs.

AI already has a massive problem in reproducibility and reliability, and AI firms gleefully kick this problem down to the users. "Never trust it's output".

It's already enough of a pain in the ass to constrain these systems without the companies silently changing things around.

And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test.

> That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.

The open question here is whether or not they were doing similar things to their other products. Claude Code shitting out a bad function is annoying but should be caught in review.

People use LLMs for things like hiring. An undeclared A-B test there would be ethically horrendous and a legal nightmare for the client.

9 comments

Anyone who trusts LLMs to do anything has shit coming. You can not trust them. If you do, that's on you. I don't care if you want to trust it to manage hiring, you can't. If you do anyway then the ethical problems are squarely on you.

People keep complaining about LLMs taking jobs, meanwhile others complain that they can't take their jobs and here I am just using them as a useful tool more powerful than a simple search engine and it's great. No chance it'll replace me, but it sure helps me do ny job better and faster.

Would you have a problem with the following scheme?

Every client is free and encouraged to feed back its financial health: profit for that hour/day/month/...

The AB(-X) test run by the LLM provider uses the correlation of a client's profit with its AB(-X) test, so that participating with the testing improves your profit statistically speaking (sometimes up sometimes down, but on average up).

You may say, what about that hiring decision? One thing is certain: when companies make more profit they are more likely to seek and accept more employees.

That sounds like a good way to get extreme short-term optimization.

Say a particular finetune prioritizes profits right now and makes recommendations like "cut down on maintenance, you can make up for it later with your increased profits and their interest". It produces more profits, and wins the AB test. Later the chickens come home to roost.

You can reduce the problem by using long-term indicators, but then each AB test is very slow.

I think you would be hard pushed to find any big tech company which doesn't do some kind of A B testing. It's pretty much required if you want to build a great product.
Yeah, that's why we didn't have anything anyone could possibly consider as a "great product" until A/B testing existed as a methodology.

Or, you could, you know, try to understand your users without experimenting on them, like countless of others have managed to do before, and still shipped "great products".

I know this is a salty take, but reliance on A/B testing to design products is indicative of product deciders who don't know what they are doing and don't know what their product should be. It's like a chef saying, I want to make a pancake, but trying 50 different combinations of ingredients until one of them ends up being a pancake. If you have to test whether a product works / is good / is profitable, then you didn't know what you were doing in the first place.

Using A/B tests to safely deploy and test bug fixes and change requests? Totally different story.

A responsible company develops an informed user group they can test new changes with and receive direct feedback they can take action on.
A big tech company has ~10k experiments running at once. Some engineers will be kicking off a few experiments every day. Some will be minor things like font sizes or wording of buttons, whilst others will be entirely new features or changes in rules.

Focus groups have their place, but cannot collect nearly the same scale of information.

I think a lot of people (myself included) would just like to not be constantly part of some sort of revenue optimization effort.

I don't care, at all, about the "scale of information" for the company's sake.

Often the experiments are not for revenue - many of them will be optimizing user experience metrics - ie. Load time or user dropoff rate.

They are clearly good for both user satisfaction and the companies bottom line.

As someone who works in these orgs, only a small fraction are about user experience metrics. 90+% are extracting more short term value with unknown second order effects on usability.
Big tech companies are not serving their "users" but advertisers, it's a common mistake.
If you have 10k experiments running then you are probably p-hacking.
A/B testing is the child of profit maximization, engagement farming, and enshittification. Not of "great product building".
Long term effectiveness? LLMs are such a fast moving target. Suppose anthropic reached out to you and gave you a model id you could pin down for the next year to freeze any a/b tests. Would you really want that? Next month a new model could be released to everyone else - or by a competitor - that’s a big step difference in performance in tasks you care about. You’d rather be on your own path learning about the state of the world that doesn’t exist anymore? nov-ish 2025 and after, for example, seemed like software engineering changed forever because of improvements in opus.
>Suppose anthropic reached out to you and gave you a model id you could pin down for the next year to freeze any a/b tests. Would you really want that?

Where can I sign up?

If you really want to keep non-determinism down, you could try (1) see if you can fix the installed version of the clause code client app (I haven’t looked into the details to prevent auto-updating..because bleeding edge person) and (2) you can pin to a specific model version which you think would have to reduce a/b test exposure to some extent https://support.claude.com/en/articles/11940350-claude-code-...

Edit: how to disable auto updates of the client app https://code.claude.com/docs/en/setup#disable-auto-updates

> Suppose anthropic reached out to you and gave you a model id you could pin down for the next year to freeze any a/b tests. Would you really want that?

Yes. I'd like some guarantee that my results are reproducible for some reasonable amount of time. New versions can also introduce regressions. A prompt that works well with today's model might not work with tomorrow's, even if the latter is "better".

> And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test.

LLMs are non-deterministic anyway, as you note above with your comment on the 'reproducibility' issue. So; any sort of research into CC's long-term effectiveness would already have taken into account that you can run it 15x in a row and get a different response every time.

Then do not use LLMs for hiring, or use a specific LLM, or self-host your own!
Isn’t the horrendous ethical and legal decision delegating your hiring process to a black box?
> ethical and legal decision

These are two very different things. I suspect that in some cases pointing finger at a black box instead of actually explaining your decisions can actually shield you from legal liability...

For some proponents, AI is liability washing
Would you rather they change things for everyone at once without testing?
That is not the only other alternative.

You can do A/B testing splitting up your audience in groups, having some audience use A, and others use B - all the time.

I think the article’s author is frustrated over sometimes getting A and at other times B, and not knowing when he is on either.

Strange! You benefitted from all the previous a/b experiments to give you a somewhat optimal model now. But now it’s too inconvenient for you?
Informed consent for a paying user is inconvenient?
Did you read the TOC?
Hiding something like this in the TOC rather than explicitly asking users to opt in is a dark pattern. You can't gain the moral highground by cackling that someone should have read the fine print.
This is technology, some politics, capitalism, and math that is trained on curiously gained data, where does non-selfish morality come in?
All TOS essentially boil down to "we owe you nothing and can change the product at anytime to anything we want at our sole discretion"

Obviously it would be unreasonable to accept such terms without further context. The further context in this case being that Anthropic will maintain Claude as an AI agent and seek to improve it's performance. What is at the heart of this issue is whether or not Anthropics recent A/B testing violated that context. Not whether or not they violated the TOS (they didn't, obviously)

Ultimately that just sounds like within their own TOC, they were just working on getting the best operational results.

If you wanted something more deterministic write it yourself or get it verified, all hosted llms as far as know does neither.

I read the article saying they were testing service changes on paying users without knowledge or explicit consent of the user, that the user had to test and determine why they perceived their sevice changed.

That is a dark pattern to inflict on users expecting consistent output.

Does anyone?
We actually didn't, if much of that A/B testing was to find the optimally "engagement maximizing" i.e. maximally addictive UI design.