| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anematode 2 days ago

Not impressed so far, to be honest. I'm having it try to optimize Stockfish in a loop (on xhigh mode) with a benchmarking oracle; even after giving it specific hints ("consider whether we're prefetching Y optimally, can we make function X branchless"), it's been so far unable to recover any of the recent optimizations we've implemented – let alone novel ones. Opus 4.8 felt a bit more creative to me ... but a small sample size so far. I'm next going to try it on some less open-ended problems.

Edit: It did correctly identify that transparent huge pages were off in its sandboxed environment and that enabling it was helpful, so that's nice. It also noticed that we skip THP on a certain less used path.

More importantly, I'm finding that the code that it produces for its experiments is a lot cleaner than what I'd expect out of Opus; there's fewer useless comments and it's more surgical and readable. I wonder if that explains the increased scores on benchmarks measuring mergability.

2 comments

wgd 2 days ago

Stockfish is a machine learning system, it seems quite plausible you might be getting slapped with the silent performance degradation (https://news.ycombinator.com/item?id=48467896).

link

redox99 2 days ago

Them silently nerfing the model without telling you, and still fully charging for it, is a new low and should probably be illegal.

link

NoahZuniga 1 day ago

Well they're not fully charging you. You get opus 4.8 pricing when it falls back to opus 4.8. Also you can disable it (and it seems like it's off by default in the api)

link

LiamPowell 1 day ago

That don't fall back to Opus if their classifier thinks you might be working on anything that might be a competitor's product. It silently injects instructions into the prompt to sabotage your work. Read the policy above, it's insane to me that they're publicly admitting to this.

link

xiphias2 1 day ago

Not for machine learning, just for security bug finding and biology

link

taurath 2 days ago

Doesn't this "silent degredation" prevent any actual evaluation of the model? If the model fails at something, this allows anyone to claim that it failed due to degradation.

link

lionkor 1 day ago

Who cares if it can be evaluated independently? The majority of commenters on HN were happy to vibe code and ship products with the models we had 1-2 years ago. It continues to be laughable.

I understand that moving the goalpost every release is unfair, but it's similarly concerning to consider that people were letting GPT 4.X vibe code and ship entire products.

link

janalsncm 2 days ago

I don’t think so? They can claim it was an act of God for all I care, but at the end of the day the model failed the task.

link

anematode 2 days ago

Yup, I suspect that's what's going on

link

dakolli 2 days ago

I suspect it just sucks, these models aren't useful. Stop lying to yourself.

link

komali2 2 days ago

No, since it's a silent failure, it's not plausible. We have to assume all results we get are the actual model performance, because, it's the actual model performance as we understand it.

Someone trying to solve similar problems will have similar results if the "silent failure" applies consistently in aggregate. So, this is the model's performance.

link

janalsncm 2 days ago

It’s possible this is happening at a technical level, but I have a hard time believing this is in the spirit of what Anthropic intends to throttle. It isn’t chip design or building out a competitor to Claude.

Stockfish does use neural nets but they are tiny, on the order of 10M params. Frontier LLMs are probably 100k or 1M times larger than that.

link

wgd 2 days ago

Yeah I agree this is probably outside of the intended scope of the silent sabotage mechanism, but there are plenty of reports of the "loud" safety classifier misfiring on innocuous requests and I'm not going to assume the silent failure mode is _less_ prone to false positives.

link

anematode 1 day ago

Edit: Another developer seems to have found a legitimate speedup with Fable in an optimization loop. It's a nice idea, actually, and I'm duly impressed.

link