Hacker News new | ask | show | jobs
by warangal 879 days ago
I think image-encoder from CLIP (even smallest variant ViT B/32) is good enough to capture a lot of semantic information to allow natural language query once images are indexed. A lot of work actually goes into integrating with existing meta-data like local-directory, date-time to augment NL query and re-ranking the results.

I work on such a tool[0] to enable end to end indexing of user's personal photos and recently added functionality to index Google Photos too!

[0] https://github.com/eagledot/hachi

2 comments

I would love to see some benchmark on that
I keep forgetting to put a benchmark for a standard flickr30k like dataset! But a ballpark figure should be about 100ms per image on a quad-core CPU, i also generate an ETA during indexing and provide some meta-information to make it easy to get information about data being indexed.
vit h and g are fine I wouldn't use b anymore.
It is quite possible B variant is not enough for some scenarios, earlier version also included the videos search, frames used for indexing were sometimes blur (not having fine-details) and these frames generally would have higher score for naive Natural language queries. I only tested with B variant.

But i resolved that problem upto a point by adding a Linear layer trained to discard such frames, and it was less costly than running a bigger variant for my use case.

Can you give details as to why not?