|
|
|
|
|
by bigattichouse
814 days ago
|
|
I've been fascinated by a small mention in the 1.58bit quantization article that mentioned 0.68 quantization , which I believe to mean 0,1 instead of 1.58's -1,0,1. When I read https://www.reddit.com/r/LocalLLaMA/comments/1bpa6ol/unoffic... great experiment of making their own unofficial 1.58b quantization, I began to wonder if I could squeeze a vector down to 1 bit. And.. I can! (with some caveats in the discussion) It was when I realized that XNOR and population count could basically score 32 dimensions at a time. While this isn't ANYTHING like an actual quantized LLM, I thought it was a really nice proof-of-concept, and could be very useful for smaller machines running RAG applications. My Code: https://github.com/bigattichouse/bitvector_research My Write Up: https://bigattichouse.medium.com/dreamcoat-cosine-similarity... NOTE: I'm not saying 30X faster than GPUs, but CPU implementations could be 30X faster. Good enough for little machines like Arduino at least. |
|