| I'm a minimally technical entrepreneur that could use some guidance on piecing together an MVP for a search product. Some quick background on the idea (also in the linked document): What Users Want: Ability to quickly generate a list of company websites that match their custom industry definition/query. MVP Goal: Complex enough to properly validate/reject the idea, simple enough to not be prohibitively expensive or too many man hours to reach validation/rejection step. Ideally, I can add features/complexity to the implementation post validation without completely changing the architecture (for example, adding vector search post initial validation with something like BM25) Example user input queries:
software for catering businesses
crane inspection service
laboratory reagent suppliers Output for each case would be a list of relevant businesses they can further filter by relevant criteria like employee count, location, etc. Some key questions I could use some help answering are:
a) At present, how much value will vector search add beyond BM25/BM25F or similar?
b) Given the recent rate of progress in LLMs, I'm expecting embeddings for search to improve at a similar rate, and therefore assuming I should expect to be implementing vector based search in the near future even if it's not part of the MVP. I've share some of my research so far in the linked document. Would really appreciate some feedback on it. How would you build this MVP if you were trying to do it bootstrapped/solo? |
Certainly larger models will come and people might find ways to make more scalable LLMs but for now you are going to be crunching your documents down to size.
It is a path less taken in the industry but there is a methodology for evaluating search engines, see
https://github.com/usnistgov/trec_eval
You can certainly try using BM25 and decide off the cuff if you like it or not but if you want to try a lot of different things you're going to need a set of documents, queries and evaluated responses ("is this relevant?")
I'd imagine you could train a retrieval model based on that kind of data much like they train ChatGPT, it's probably not as hard but would be a substantial project that would need a lot of training data but I bet you could beat cosine similarity on the vectors.