Hacker News new | ask | show | jobs
by kacperlukawski 1067 days ago
If you need semantic search locally then it's fine, but serving an embedding model might be still challenging. And if you want to expose it, your laptop might be not enough.
3 comments

I've hosted embedding models on AWS Lambda (fair that this is a vendor, but 1 vs. 3), if you try an LLM with 1B+ parameters you will struggle, but if the difference between a light-weight BERT-like transformer and an LLM is only a few % of loss, why bother getting your credit card out?

Edit: another thought, skip lambda entirely and run the embedding job on the server as a background process, and use an on-disk vector store (lancedb)

Shameless plug: I built Mighty Inference Server to solve this problem. Fast embeddings with minimal footprint. Better BEIR and MTEB scores using the lightning fast and small E5 V2 models. Scales linearly on CPU, no GPU needed.

https://max.io

The initial version of this actually used Mighty, but I didn't find any free tier available, so I switched to Cohere to keep the $0 pricetag.
Mighty is free if you're not making money from it. You could have used Mighty and I would have been glad to help you set it up :)
There's a bit of a difference between what you see following the 'purchase' link and what you see if you scroll down to 'pricing' on your site. It confused me at first too - I'm just so used to seeing a 'pricing' link in the top bar, I pretty much always go there first to see if there's a reasonable free tier for me to play with something.
Thanks for the feedback! I'll do my best to make things more clear.
You serve the embedding model in a lambda and then run something like FAISS in the backend.