| Hey HN! We're Alex and Szymon from Bluesight (https://bluesight.ai/), where we're developing a foundation model for satellite data. We've created a demo to showcase the current capabilities of state-of-the-art models and identify areas for improvement. Our demo allows you to search for objects in San Francisco using natural language. You can look for things like Tesla cars, dry patches, boats, and more. Key features: - Search using text or by selecting an object from the image as a source ("aim" icon) - Toggle between object search (default) and tile search ("big" toggle, useful when contextual information matters, like tennis courts) - Adjust results with downvotes (useful when results are water images) - Click on tiles to locate them on a map - Control the number of retrieved tiles with a slider We use OpenAI's CLIP model (https://openai.com/index/clip/) to put texts and images into the same embedding space. We do a similarity search within this space using text query or source image. We are using CLIP finetuned on pairs of satellite images and OpenStreetMap (https://www.openstreetmap.org/) tags (https://github.com/wangzhecheng/SkyScript) because vanilla clip performs poorly on satellite data. We pre-segment objects using Meta's Segment Anything Model (https://segment-anything.com/) and pre-compute CLIP embeddings for each object. We'd love to hear your thoughts! What worked well for you? Where did it fail? What features do you wish it had? Any real-world problems you think this could help with? |
Some notes from experimenting:
High level, I would expect that using a LLM to caption the images post SAM would do better than CLIP by itself.
I wish that it put my search query and mode as a URL param so sharing would be a bit easier.
Main use-case I could come up with would be spotting areas lacking accessibility features or maybe homeless encampments? Idk, I definitely couldn't think of any commercialization ideas off the jump.
Another thought I had was why you guys chose aerial satellite instead of street view data? I imagine something like "palace of fine arts" would have worked with that approach.
- "skate park" has some classic dense vector failures where it will find an outdoor playground-style area more similar than an actual skate park. Also, big mode is really cool and works much better for these kinds of queries. "chess board" will rank checker patterns over actual chess boards. Bunch of examples will follow this pattern. I wish there was an additional search mode for LLM generated descriptions of the segmented images, but there are probably cost constraints there.
- "USF" also didn't work well. I guess not surprising given it's CLIP, but still kind of interesting. I wonder what it would take to make the multi-modal models better at OCR without actually doing OCR.
- "beach" didn't work great which surprised me
- "picnic tables" and "lots of people" also didn't work | no idea why