| > I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo. I disagree in the case of LLMs. AI already has a massive problem in reproducibility and reliability, and AI firms gleefully kick this problem down to the users. "Never trust it's output". It's already enough of a pain in the ass to constrain these systems without the companies silently changing things around. And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test. > That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here. The open question here is whether or not they were doing similar things to their other products. Claude Code shitting out a bad function is annoying but should be caught in review. People use LLMs for things like hiring. An undeclared A-B test there would be ethically horrendous and a legal nightmare for the client. |
People keep complaining about LLMs taking jobs, meanwhile others complain that they can't take their jobs and here I am just using them as a useful tool more powerful than a simple search engine and it's great. No chance it'll replace me, but it sure helps me do ny job better and faster.