Hacker News new | ask | show | jobs
by hashemian 417 days ago
To those argue that LLMs might cheat by using EXIF, I saw a post recently on twitter (https://x.com/tszzl/status/1915212958755676350) and out of curiosity, screen-captured the photo and passed it to O3. So no EXIF.

You can read the chat here: https://chatgpt.com/share/680a449f-d8dc-8001-88f4-60023323c7...

It took 4.5m to guess the location. The guess was accurate (checked using Google Street View).

What was amazing about it:

    1. The photo did not have ANY text

    2. It picked elements of the image and inferred based on those, like a fountain in a courtyard, or shape of the buildings.
All in all, it's just mind-blowing how this works!
3 comments

See my other comment: https://news.ycombinator.com/item?id=43804041

4o can do it almost as well in a few seconds and probably 10-50x fewer tokens: https://chatgpt.com/share/680ceeff-011c-8002-ab31-d6b4cb622e...

o3 burns through what I assume is single-digit dollars just to do some performative tool use to justify and slightly narrow down its initial intuition from the base model.

It absolutely tried to use EXIF data when I asked it to guess the location. Here is proof - https://imgur.com/a/CHde2Cx

I couldn't attach the chat directly since it's a temporary chat.

I don't see how this is mind blowing, or even mildly surprising! It's essentially going to use the set of features detected in the photo as a filter to find matching photos in the training set, and report the most frequent matches. Sometimes it'll get it right, sometimes not.

It'd be interesting to see the photo in the linked story at same resolution as provided to o3, since the licence plate in the photo in the story is at way lower resolution than the zoomed in version shown that o3 had access to. It's not a great piece of primary evidence to focus on though since a CA plate doesn't have to mean the car is in CA.

The clues that o3 doesn't seem to be paying attention to seems just as notable as the ones it does. Why is it not talking about car models, felt roof tiles, sash windows, mini blinds, fire pit (with warning on glass, in english), etc?

Being location-doxxed by a computer trained on a massive set of photos is unsurprising, but the example given doesn't seem a great example of why this could/will be a game changer in terms of privacy. There's not much detective work going on here - just narrowing the possibilities based on some of the available information, and happening to get it right in this case.

If you want to be impressed I suggest trying this yourself on your own photos.

I don't consider it my job to impress or mind-blow people: I try to present as realistic as possible a representation of what this stuff can do.

That's why I picked an example where its first guess was 200 miles off!

Reading the replies to this is funny. It's like the classic dropbox thread. "But this could be done with a nearest neighbor search and feature detection!" If this isn't mind blowing to someone I don't know if any amount of explaining will help them get it.
It's not mindblowing because there were public systems doing performing much better years earlier. Using the exact same tech. This is less like rsync vs drop box and more like you are freaking out over Origin or Uplay when Steam has been around for years.
Which public systems were those?
I'm not a computer. I expect a computer to also do better than me at memorizing the phone book, but I'm not impressed by it.
In that case, are you at all surprised that this technology did not exist two years ago?
I'm not sure what you're getting at. What's useful about LLMs, and especially multi-modal ones, is that that you can ask them anything and they'll answer to best of their ability (especially if well prompted). I'm not sure that o3, as a "reasoning" model is adding much value here - since there is not a whole lot of reasoning going on.

This is basically fine-grained image captioning followed by nearest neighbor search, which is certainly something you could have built as soon as decent NN-based image captioning became available, at least 10 years ago. Did anyone do it? I've no idea, although it'd seem surprising if not.

As noted, what's useful about LLMs is that they are a "generic solution", so one doesn't need to create a custom ML-based app to be able to do things like this, but I don't find much of a surprise factor in them doing well at geoguessing since this type of "fuzzy lookup" is exactly what a predict-next-token engine is designed to do.

How does nearest neighbor search relate to this?
So you admit that this tech is at least 2 years old publicly and likely much older privately?
Did it not, or no one was interested enough to build one? I’m pretty certain there’s a database of portraits somewhere where they search id details from photograph. Automatic tagging exists for photo software. I don’t see why that can be extrapolated to landmarks with enough data.
I think you are underestimating the importance of a "world model" in the process. It is the modeling of how all these details are related to each other that is critical here.

The LLM will have an edge by being able to draw on higher level abstract concepts.

If it existed two years ago I certainly couldn't play with it on my phone.