| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hedora 51 days ago

They implemented both those things, but only apologized for the first. They’re doubling down on the second.

My limited experience with fable over the last few days suggests (1) I can’t see any improvement in output, and (2) it is useless for writing secure software because it constantly hits safety walls if you ask it to close security holes.

I’m definitely shopping around for other LLM providers next week, and testing vs local (target: 128GB strix halo - any war stories?)

2 comments

coreyp_1 51 days ago

With 128 GB strix halo, you can't do as big of a model as you would think. You can do larger than having a single graphics card, of course, but that 128 gigs cannot all be dedicated to the model. Remember, the context alone is usually larger than the model itself. I got an EVO X2, and I don't regret it, but by my current calculations, it will take 8 years to recoup the cost, as opposed to just using equivalent, paid commercial options.

link

smilekzs 51 days ago

A key consideration in favor of running your local LLM despite all the trouble: The commercial serving endpoint may not exist tomorrow, or at least not at the same price.

link

hedora 51 days ago

My current rule of thumb is 1GB gets you 1B parameters with a big context. (Qwen 32B fits in 32GB with 200K+ contexts)

That’s with heavy compression of the weights and the context, of course.

I haven’t gone through model evaluation + shoehorning at 128GiB yet.

link

keeganpoppen 50 days ago

the output is definitely better. and i find it crazy how every time a new model comes out people trip over themselves to say how much worse it is than previous models, when in fact that is basically an impossibility. like, they've got the numbers, man-- you only release a new model when the numbers get gooder. the burden of proof is on the "didn't get better" side, not the "prove that it's better" side, because the architecture itself (1) only works because of how giant the training data / eval / etc. sets are and (2) has a fractal property of becoming strictly deeper and more thoughtful when you just click and drag the edge up and to the right (obviously AI research is harder than this, but that doesn't make the general point untrue). i say this especially because the scuttlebut is that this model genuinely is a shift-click-expand moreso than any sort of architectural "new science" or anything.

this is exactly why hypotheses come before the experiment in the scientific method.

link

suttontom 50 days ago

You're wrong in lots of ways.

Some model cards do show regressions on benchmarks for newer models on specific tasks: https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...

This wasn't a new model but updates to models backed by numbers being better can make the model worse: https://openai.com/index/sycophancy-in-gpt-4o/

The slight increases in performance/benchmarks may be just noise: https://arxiv.org/pdf/2602.07150

link