Hacker News new | ask | show | jobs
by a3camero 2084 days ago
I’ve done this over the last year on a tiny scale for my own needs: gorillafind.com. It’s from scratch and just for government sites to sidestep some of the challenges (but so far only has 50 sites). The cost per site is around $1/mo for crawling, indexing, converting file formats and then serving up results. It’s difficult but not impossible and very educational. If you’d like to hear more about doing it yourself and some of the challenges feel free to email me with the contact info on the site. My system isn’t open source but I’m more than happy to chat about the research I’ve done and how you can make one.

I’d start off with not doing state of the art because it’s overkill for an “MVP”. And if you don’t need proper browser rendering of pages, there’s open source crawlers out there like Nutch that might work. If you’re making one yourself, the outdated academic papers and presentations by search companies are a good resource as the basic ideas of crawling and indexing haven’t changed too much (even if ranking and other components have changed a lot). A search engine is really a set of related components and there are many examples out there to use as inspiration for your MVP.

1 comments

> gorillafind.com

Interesting. (゚ヮ゚) You seem to know well that search is beyond ranking and retrieval. [1]

> I’d start off with not doing state of the art because it’s overkill for an “MVP”.

Can a search engine "MVP" sustain without using sota methods/practices?

What should be the initial scale for a search MVP for which it just works? In the modern realm.

[1] : https://medium.com/startup-grind/what-every-software-enginee...

just a note (you can ignore) : I am actually a believer of iterations in development and incremental optimisations. But even a minimum viable product should be able to work.