Hacker News new | ask | show | jobs
by bradhilton 551 days ago
Beats Gemini 1.5 Pro at all but two of the listed benchmarks. Google DeepMind is starting to get their bearings in the LLM era. These are the minds behind AlphaGo/Zero/Fold. They control their own hardware destiny with TPUs. Bullish.
4 comments

If you look at where talent is going, it's Anthropic that is the real competitor to Google, not OpenAI.
Are these benchmarks still meaningful?
No, and they haven't been for at least half a year. Utterly optimized for by the providers. Nowadays if a model would be SotA for general use but not #1 on any of these benchmarks, I doubt they'd even release it.
I've started keeping an eye out for original brainteasers, just for that reason. GCHQ's Christmas puzzle just came out [1], and o1-pro got 6 out of 7 of them right. It took about 20 minutes in total.

I wasn't going to bother trying those because I was pretty sure it wouldn't get any of them, but decided to give it an easy one (#4) and was impressed at the CoT.

Meanwhile, Google's newest 2.0 Flash model went 0 for 7.

1: https://metro.co.uk/2024/12/11/gchq-christmas-puzzle-2024-re...

Why are you comparing flash vs o1-pro, wouldn't a more fair comparison be flash vs mini?
I just ask o1-mini the first two questions and it got it wrong.
It's the only Google model that my account has access to that accepts .PNG files. I assumed it was the latest/greatest experimental 2.0 release.

If they want a rematch, they'll need to bring their 'A' game next time, because o1-pro is crazy good.

Did it get the 8 right? The linked article provides the wrong answer btw.
I didn't see a straightforward way to submit the final problem, because I used different contexts for each of the 7 subproblems.

Given the right prompt, though, I'm sure it could handle the 'find the corresponding letter from the landmarks to form an anagram' part. That's easier than most of the other problems.

You're saying the ultimate answer isn't 'PROTECTING THE UNITED KINGDOM'?

if you follow the sleigh morse path starting from the robin it will be 'united in protecting the kingdom'.
Wow! That’s all I need to know about Google’s model.
What is impressive about this new model is that it is the lightweight version (flash).

There will probably be a 2.0 pro (which will be 4o/sonnet class) and maybe an ultra (o1(?)/Opus).

That's a comparison of multiple GPT-4 models working together... against a single GPT-4 mini style model.
multiple GPT-4 models working together

What do you mean? Is o1 not a single model?

Regarding TPU’s, sure for the stuff that’s running on the cloud.

However their on device TPUs lag behind the competition and Google still seem to struggle to move significant parts of Gemini to run on device as a result.

Of course, Gemini is provided as a subscription service as well so perhaps they’re not incentivized to move things locally.

I am curious if they’ll introduce something like Apple’s private cloud compute.

i don’t think they need to win the on device market.

we need to separate inference and training - the real winners are those who have the training compute. you can always have other companies help with inference

> i don’t think they need to win the on device market.

The second Apple comes out with strong on-device AI - and it very much looks like they will - Google will have to respond on Android. They can't just sit and pray that e.g. Samsung makes a competitive chip for this purpose.

I think Apple is uniquely disadvantaged in the AI race to a point people dont realize. They have less training data to use, having famously been focused on privacy for its users and thus having no particular advantage in this space due to not having customer data to train on. They have little to no cloud business, and while they operate a couple of services for their users, they do not have the infrastructure scale to compete with hyperscaler cloud vendors such as Google and Microsoft. Most of what they would need to spend on training new models would require that they hand over lots of money to the very companies that already have their own models, supercharging their competition.

While there is a chance that Apple might come out with a very sophisticate on-device model. The problem here is that they would only be able to compete with other on-device models. The magnitude of compute needed to keep pace with SOA models is not achievable on a single device. It will take many generations of Apple silicon in order to compete with the compute of existing datacenters.

Google also already has competitive silicon in this space with the Tensor series processors, which are being fabbed at Samsung plants today. There is no sitting and praying necessary on their part as they already compete.

Apple is a very distant competitor in the space of AI, and I see no reason to assume this will change, they are uniquely disadvantaged by several of the choices they made on their way to mobile supremacy. The only thing they currently have going for them is the development of their own ARM silicon which may give them the ability to compete with Google's TPU chips, but there is far more needed to be competitive here than the ability to avoid the Nvidia tax.

There’s an easy solution here: Apple isn’t trying to compete with the big models everyone else is running. They’re betting in the opposite direction that many small models is a better value ad for their customers. And they can call out to other services as needed for the larger stuff.

I’m in the camp that this is the right call for consumers, instead of trying to compete on the large model side. They’ve yet to deliver on their full promise, but if they can, it’s the place where I think more of the industry will go (for consumers)

And regarding Google’s mobile tensor chips, they are infamously behind all other players in the market space for the same generation of processor. They don’t share the same advantages they do in the server space.

training bigger models gets you small models for free plus a higher upper bound in capabilities.

Apple just isn’t very capable in this space, not sure what’s so hard to accept

Apple have trained their own foundation LLM.
"having famously been focused on privacy for its users and thus having no particular advantage in this space due to not having customer data to train on"

That may not be as big a disadvantage as you think.

Anthropic claim that they did not use any data from their users when they trained Claude 3.5 Sonnet.

sure but they certainly acquired data from mass scraping (including of data produced by their users) and/or data brokering aka paying someone to do the same.
It is likely Apple can get additional data by creating synthetic data for user interactions.

About 7 years ago I trained GAN models to generate synthetic data, and it worked so well. The state of the art has increased a lot in 7 years, so Apple will be fine.

For a while there I would have been in agreeance with you, but the thought that models can be trained purely on synthetic data has shown to be wrong on multiple levels. Synthetic data needs to be reviewed by individuals to ensure data quality, significantly reducing the speed at which an organization can adopt training data. Reasonable engineers would suggest that the answer to this is to have other language models review the synthetic data, but we have seen that this is what leads to model collapse due to compounding issues around hallucinations.

At best Synthetic data is a "slow follow" for training a model due to the need for human review, but a competitive model, it does not make.

yeah i’ve never understood the outsized optimism for apple’s ai strategy, especially on hn.

they’re a little bit less of a nobody than they used to be, but they’re basically a nobody when it comes to frontier research/scaling. and the best model matters way more than on-device which can always just be distilled later and find some random startup/chipco to do inference

Theory: Apple's lifestyle branding is quite important to the identity of many in the community here. I mean, look at the buy-in at launch for Apple Vision Pro by so many people on HN--it made actual Apple communities and publications look like jaded skeptics.
For clarity, I was only talking about the hardware side, not the software one. I don't think the models matter too much, by the time the hardware is ready there will be open models that Apple can take and modify to their liking.

Besides, did Anthropic and e.g. Mistral inherently have such troves of data to train on that Apple doesn't? For the last 6 months, Anthropic has had the SOTA model for the average production usecase.

> Google also already has competitive silicon in this space with the Tensor series processors, which are being fabbed at Samsung plants today. There is no sitting and praying necessary on their part as they already compete.

Intel had a much bigger advantage with x86, and look where we are now. I find it hard to believe that creating a good AI chip isn't a much smaller challenge than it was to do Apple Silicon. The upcoming SE uses their in-house 5G modem, another huge hardware achievement that no one else has been able to do.

With that in mind, how can you bet against Apple when it comes to designing chips at this point? It's not like Amazon et al aren't producing their own AI chips too. Let alone all of the startups like Cerebras. That indicates the moat and barriers are likely much lower than Apple Slicion or the 5G modem.

If I'm talking nonsense, do correct me.

The Android on chip AI is and has been leagues better than what is available on iOS.

If anything, I think the upcoming iOS AI update will bring them to a similar level as android/google.

But given inference time compute, to give a strong reply reasonably fast, you'll need a lot of compute, very rarely used.

Economically this fits the cloud much better.

At what point does the on device stuff eat into their market share though? As on device gets better, who will pay for cloud compute? Other than enterprise use.

I’m not saying on device will ever truly compete at quality, but I believe it’ll be good enough that most people don’t care to pay for cloud services.

You're still focused about inference :)

inference basically does not matter, it is a commodity

You’re still focused about training :)

training doesn’t matter if inference costs are high and people don’t pay for them

but inference costs arent high already and there are tons of hardware companies that can do relatively cheap LLM inference
That makes no sense. Inference cost dwarf training cost if you have a succesfull product pretty quickly. Afaik there is no commodity hardware that can run state of the art models like chatgpt-o1.
> Afaik there is no commodity hardware that can run state of the art models like chatgpt-o1.

Stack enough GPUs and any of them can run o1. Building a chip to infer LLMs is much easier than building a training chip.

Just because one cost dwarfs another does not mean that this is where the most marginal value from developing a better chip will be, especially if other people are just doing it for you. Google gets a good model, inference providers will be begging to be able to run it on their platform, or to just sell google their chips - and as I said, inference chips are much easier.

I don’t think the AI market will ever really be a healthy one until inference vastly outnumbers training. What does it say about AI if training is done more than inference?

I agree that the in-device inference market is not important yet.

done more != where the value is at

inference hardware is a commodity in a way that training is not

Majority of people want better performance, running locally is just a nice to have feature.
They’ll care though when they have to pay for it, or when they’re in an area with poor reception.
They pay to run it locally as well (more expensive hardware)

And sure, poor reception will be an issue, but most people would still absolutely take a helpful remote assistant over a dumb local assistant.

And you don't exactly see people complaining that they can't run Google/YouTube/etc locally.

Your first sentence has the fallacy that you’re attributing the cost of the device to a single feature against the cost of that single feature.

Most people are unlikely to buy the device for the AI features alone. It’s a value add to the device they’d buy anyway.

So you need the paid for option to be significantly better than the free one that comes with the device.

Your second sentence assumes the local one is dumb. What happens when local ones get better? Again how much better is the cloud one to compete on cost?

To your last sentence, it assumes data fetching from the cloud. Which is valid but a lot of data is local too. Are people really going to pay for what Google search is giving them for free?

I think it's a more likely assumption that on device performance will trail off device models by a significant margin for at least the next few years - of course if magically you can make it work locally with the same level of performance it would be better.

Plus a lot of the "agentic" stuff is interaction with the outside world, connectivity is a must regardless.

Poor reception is rapidly becoming a non-issue for most of the developed world. I can’t think of the last time I had poor reception (in America) and wasn’t on an airplane.

As the global human population increasingly urbanizes, it’ll become increasingly easy to blanket it with cell towers. Poor(er) regions of the world will increase reception more slowly, but they’re also more likely to have devices that don’t support on-device models.

Also, Gemini Flash is basically positioned as a free model, (nearly) free API, free in GUI, free in Search Results, Free in a variety of Google products, etc. No one will be paying for it.

Many major cities have significant dead spots for coverage. It’s not just for developing areas.

Flash is free for api use at a low rate limit. Gemini as a whole is not free to Android users (free right now with subscription costs beyond a time period for advanced features) and isn’t free to Google without some monetary incentive. Hence why I also originally ask about private cloud compute alternatives with Google.

I ride a ferry from a city of 50k to a city of 700k in the US and work in a building with apartments upstairs basically a concrete cave.

I see poor reception in both areas and only one has WiFi.

You can run model >100x faster in cloud compared to on device with DDR RAM. This would make up for the reception.
And you can’t run the cloud model at all if you can’t talk to the cloud.
Yes, but I can't imagine situations where I "have" to run a model when I don't have internet at that time. My life would be more affected with the rest of the internet than having to run a small stupid model locally. At the very least until the hallucination is completely solved, as I need internet to verify the models.
Latency is a huge factor in performance, and local models often have a huge edge. Especially on mobile devices that could be offline entirely.
Definitely not when it comes to LLM's, the larger more useful local models are not that fast and latency is not an issue, just look at this Google models voice function or even openai's advanced voice.
If the model weights is not open, you can't run it on device anyways.
The Pixel 9 runs many small proprietary Gemini models on the internal TPU.
And yet these new models still haven’t reached feature parity with Google Assistant, which can turn my flashlight on, but with all the power of burning down a rainforest, Gemini still cannot interact with my actual phone.
I just tried asking my phone to turn on the flashlight using Gemini. It worked. https://9to5google.com/2024/11/07/gemini-utilities-extension...
Ok I tried literally last week on Pixel 7a and it didn’t work. What model do you have? Maybe it requires a phone that can do on-device models?
Gemini nano weights are leaked and google doesn't care about it being leaked. Google would definitely care if Pro weights are leaked.
Is there any phone in the world that can realistically run pro weights?
Yeah they've been slow to release end-user facing stuff but it's obvious that they're just grinding away internally.

They've ceded the fast mover advantage, but with a massive installed base of Android devices, a team of experts who basically created the entire field, a huge hardware presence (that THEY own), massive legal expertise, existing content deals, and a suite of vertically integrated services, I feel like the game is theirs to lose at this point.

The only caution is regulation / anti-trust action, but with a Trump administration that seems far less likely.