Hacker News new | ask | show | jobs
by boyter 4699 days ago
Personally I consider extracting meaning from the crawl part of the indexing step. That just comes down to how you define it though. In reality its blurry as you need to do some pre-indexing during the crawl to extract meaningful data and as you say there are a lot of edge cases.

For basic crawling it really is as simple as while links download link though.