Hacker News new | ask | show | jobs
by cricketlover 753 days ago
I went through the post and I have absolutely no clue what this person is talking about. But I want to be in a place where I can understand what the person is saying.

How can I reach that point? I was lost at quantized, could understand bit packing, and was even more lost when the author started talking about things like Hamming Distance.

Please help me out. I want to grow my career in this direction.

2 comments

First you need to understand embeddings, and CLIP. I have a detailed guide here that should help you with that: https://simonwillison.net/2023/Oct/23/embeddings/

Then you need to understand binarization. This is a surprisingly effective trick that observes that if you have an embedding vector of, say, 1000 numbers those numbers for many models will be very small floating point numbers that are just above or below zero.

It turns out you can turn those thousand floating point numbers into one thousand single bits where each bit simply represents if the value is above or below zero... and the embedding magic mostly still works!

And instead of the usual cosine distance you can use a much faster hamming distance function to compare two vectors instead.

Once you understand embedding vectors and CLIP that should hopefully make sense.

The part of CLIP[1] that you need to know to understand this is that it embeds text and images into the same space. ie: the word "dog" is close to images of dogs. Normally this space is a high dimensional real space. Think 512-dimensional or 512 floating point numbers. When you want to measure "closeness" between vectors in this space cosine similarity[2] is a natural choice.

Why would you want to quantize values? Well, instead of using a 32-bit float for each dimension, what if you could get away with 1-bit? You would save you 31x the space. Often you'll want to embed millions or billions of pieces of text or images, so the savings represent a huge speed & cost savings and if accuracy isn't impacted too much then it could be worth it.

If you naively clip the floats of an existing model, it severely impacts accuracy. However, if you train a model from scratch that produces binary outputs, then it appears to perform better.

There is one twist. Deep learning models rely on gradient descent to train and binary output doesn't produce useful gradients. We use cosine similarity on floating point vectors and hamming distance on bit vectors. Is there a function that behaves like hamming distance but is nicely differentiable? We can then use this function during training and then vanilla hamming distance during inference. It seems like they've done that.

I'd suggest playing around with OpenCLIP[3]. My background is in data science but all my CLIP knowledge comes from doing a side project over the course of a couple weekends.

1. https://huggingface.co/docs/transformers/model_doc/clip

2. https://en.wikipedia.org/wiki/Cosine_similarity

3. https://github.com/mlfoundations/open_clip