Hacker News new | ask | show | jobs
by mstolpm 3570 days ago
In addition to the lack of removing porn and the ordering of the results not priorizing "quality" sources, some of the indexed site data is at least 4-6 months old and has heavily changed since the last crawl. I even got 404 errors. That makes it very hard to really find use in the project other than for academic interest.
1 comments

A fresh recrawl is currently running. Should take about 2-3 months. Newly crawled data will gradually replace older data during that time.
Great work, congrats. :-)

Here is some input based on my experience building a similar project at my former company. (We did not quite get to 2B pages, but were close to ~300M):

For creating a really viable (alternative) search engine, the freshness of your index is going to be a fairly important factor. Now, obviously, re-crawling a massive index frequently/regularly is going to need/consume some huge amounts of bandwidth + CPU cycles. Here is how we had optimized the resource utilization:

Corresponding to each indexed URL, store a 'Last Crawled' time-stamp.

Corresponding to each indexed URL, also store a sort-of 'crawl-history' (If space is a constraint, don't store each version of the URL, store only the latest one). On each re-crawl, store two data fields: time-stamp and a boolean if the URL content has changed since last crawl. As more re-crawl cycles run, you will be able to calculate/predict the 'update frequency' of each URL. Then, prioritize the re-crawls based on the update frequency score (i.e. re-crawl those with higher scores more frequently and the others less frequently).

If you need any more help/input, let me know and I'll be happy to do what I can.

HTH and all the best moving forward.

We had also (obviously) built a (proprietary) ranking algo that took into account some 60+ individual factors. If it can be of any help, I'll create a list and send it to you.
Why not write that list here ?
Good idea. However, I'll need to really exercise the gray cells to put together the list so it might take me a couple of days. Once done, I'll post it here.