| HN Mirror

Comments trashing this are rightly correct skeptics who remember the benchmaxxing of llama 4. This model was out in the woods as early as like a couple months ago but they didn't release it because it was at gemini 2.5 pro levels.

> 4. This model was out in the woods as early as like a couple months ago but they didn't release it because it was at gemini 2.5 pro levels.

Source? (Even if rumor)

https://www.nytimes.com/2026/03/12/technology/meta-avocado-a...

nl 63 days ago

NYTimes had a story about this (March 12):

> Meta’s new foundational A.I. model, which the company has been working on for months, has fallen short of the performance of leading A.I. models from rivals like Google, OpenAI and Anthropic on internal tests for reasoning, coding and writing, said the people, who were not authorized to speak publicly about confidential matters.

> The model, code-named Avocado, outperformed Meta’s previous A.I. model and did better than Google’s Gemini 2.5 model from March, two of the people said. But it has not performed as strongly as Gemini 3.0 from November, they said.

> They added that the leaders of Meta’s A.I. division had instead discussed temporarily licensing Gemini to power the company’s A.I. products, though no decisions have been reached.

https://archive.is/uUV5h#selection-715.98-715.277

It was from a techmeme ride home podcast where the host discussed "sources at the company said". I don't remember which day's episode it was.

zozbot234 64 days ago

The llama4 series was one of the earliest large MoE's to be made publically available. People just ignored it because they were focused on running smaller and denser models at the time, we should know better these days.

dilap 64 days ago

Deepseek R1 was a publically-available, MoE model that was getting a ton of attention before llama4. Llama4 didn't get much attention because it wasn't good.

jychang 64 days ago

Also, Gemini 2.5 Pro launched a week before Llama 4.

It was Gemini 2.5 Pro that redeemed Google in the eyes of most people as a valid competitor to OpenAI instead of as a joke, so Meta dropping the ball with Llama 4 was extra bad.

the models were objectively horrible

NitpickLawyer 64 days ago

They really weren't horrible. They were ~gpt4o, with the added benefit that you could run them on premise. Just "regular" models, non "thinking". Inefficient architecture (number of active out of total) but otherwise "decent" models. They got trashed online by bots and chinese shills (I was online that weekend when it happened, it's something to behold). Just because they were non-thinking when thinking was clearly the future doesn't make them horrible. Not SotA by any means, but still.

[1] https://sql-benchmark.nicklothian.com/#all-data

nl 64 days ago

> They were ~gpt4o, with the added benefit that you could run them on premise.

No, they are bad models. They were benchmaxxed on LMAreana and a few other benchmarks but as soon as you try them yourself they fall to pieces.

I have my own agentic benchmark[1] I use to compare models.

Llama-4-scout-17b-16e scores 14/25, while llama-4-maverick-17b-128e scores 12/25.

By comparison gemma-4-E4B-it-GGUF:Q4_K_M scores 15/25 (that is a 4B parameter model!) - even GPT3.5 scores 13/25 (with some adjustment because it doesn't do tool calling).

Llama 4 was a bad model, unfortunately.

ac29 63 days ago

> By comparison gemma-4-E4B-it-GGUF:Q4_K_M scores 15/25 (that is a 4B parameter model!)

Gemma 4 E4B is slightly confusingly named, its a 8B param model

Wrote longer comment steel-manning this, posted it to a reply, then realized you might like to know they had a reasoning model on deck ready for release in the next 2-4 weeks.

Got shitcanned due to bad PR & Zuck God-King terraforming the org, so there'd be a year delay to next release.

Real tragi-comedy, and you have no idea how happy it makes me to see someone in the wild saying this. It sounds so bizarre to people given the conventional wisdom, but, it's what happened.

Nah I remember how disgusted I felt trying llama 4 maverick and scout. They were both DOA.. couldn't even beat much smaller local models.

I'll cosign what you said, simultaneously, yr interlocutor's point is also well-founded and it depresses me it's not better known and sounds so...off...due to conventional wisdom x God King Zuck's misunderstanding his own company and resulting overreaction.

They beat Gemini 2.5 Flash and Pro handily on my benchmark suite. (tl;dr: tool calling and agentic coding).

Llama 4 on Groq was ~GPT 4.1 on the benchmark at ~50% the cost.

They shouldn't have released it on a Saturday.

They should have spent a month with it in private prerelease, working with providers.[1]

The rushed launch and ensuing quality issues got rolled into the hypebeast narrative of "DeepSeek will take over the world"

I bet it was super fucking annoying to talk to due to LMArena maxxing.

[1] my understanding is longest heads up was single-digit days, if any. Most modellers have arrived at 2+ weeks now, there's a lot between spitting out logits and parsing and delivering a response.

pixel_popping 64 days ago

failing non-stop at tool calls on top of that.

owebmaster 63 days ago

Thanks for calling me a bot. Llama4 and meta ai sucks

canes123456 64 days ago

Why go into coding agents? Both anthropic and OpenAI are going all in on that. The opportunity is customer facing AI now.

OpenAI has the mindshare but they going to have to decide if they allocate their limited compute for free users or go all in trying to keep up with Anthropic in enterprise.

kaycey2022 64 days ago

you can do way more than just coding with the coding agents.

foobiekr 64 days ago

Because coding agents are where the revenue is.

If you squint at coding agents you see the next OS.

Maybe better phrasing is “HCI paradigm”, but that somehow manages to say everything and nothing.

whattheheckheck 64 days ago

Programming was always about designing rube goldberg systems that did a complicated state machine akin to dominos but now we have a probabalistic and nondeterministic domino that has a huge amount of dominos inside amd can dynamically generate many different paths of dominos sometimes not even leading to the intended final domino you wanted to fall.

I see it more like a compiler

RealStupidity 63 days ago

I agree that it's more like a compiler (turns higher level language into machine code) but I also think that's only half the story - a compiler could never turn requirements into functional software, generate boilerplate or debug. It's also a development tool

modeless 64 days ago

It's a decent model if the benchmarks are to be believed, but it won't be close to Opus in usefulness for programming. None of these benchmarks completely capture what makes a model useful for day-to-day coding tasks, unfortunately. It will take time for them to catch up, and Opus will keep improving in the meantime. But it's good to have more competition.

ai5iq 64 days ago

Benchmarks miss the thing that actually matters for agentic use: how does behavior change over a multi-day horizon? A model that scores well on one-shot coding tasks can still make terrible decisions when it has persistent state and resource constraints. That's where you see the real gaps between models.

andai 63 days ago

Is there a benchmark for these long tasks? That kind of seems like the only number worth measuring.

(Of course at that point it involves memory and context management and so on, so you're testing the harness as well as the model.)

redox99 64 days ago

> If it slightly beats or even matches Opus 4.6

It doesn't though

ryeguy_24 64 days ago

Curious on why you think this. Any data points that led you to this?

howdareme 64 days ago

The benchmarks they released

johnfn 64 days ago

What do you mean? In most cases, the benchmarks show a larger number for Muse and a smaller number for Opus.

spprashant 64 days ago

In Multimodal yes, but Opus is definitely edging out in Text/Reasoning and Agentic benchmarks.

I think the general skepticism is because they are late to race, and they are releasing a Opus-4.6-equivalent model now, when Anthropic is teasing Mythos.

ChipopLeMoral 64 days ago

> I don't get the comments trashing this.

People like to hate on Meta regardless of anything, and regardless of whether it's justified or not. Not saying it isn't, just that it's many people's default bias.

jatora 64 days ago

That is not the case here. Nobody hated on llama 1,2,3 at all. They justifiably felt burned by the benchmaxxing of llama 4. Trust broken must be re-earned, and benchmarks alone cannot do that.

blazespin 64 days ago

Because bots and trillion dollar ipos and even bigger stakes. People need to better appreciate the level of manipulation going on. Social media has an outsized impact. Bots and even people are getting paid to post and upvote/downvote narratives.

asdfman123 64 days ago

> people are getting paid to post and upvote/downvote narratives

This problem will be solved shortly with better AI (if it hasn't essentially been solved already).

No more humans in the loop, much lower costs for social media manipulation. Welcome to the future!