| > It’s you who chooses what sites we crawl Yeah, but you still reserve the right to not crawl sites (or to remove them from your index), yes? So there's still the opportunity to do evil. I'm still waiting for a "raw" search spidering provider. One that: 1. runs a web-spidering cluster — one that's only smart enough to know what robots.txt is, to know how to follow links in HTML pages, and to obey response caching-policy headers; 2. captures the spidering process losslessly, as e.g. HAR transcript files; 3. packs those HAR transcript files, a few million at a time, into tar.xz.tar files (i.e. grab a "chunk" of N HAR files; group them into subdirs by request Host header; archive each subdir, and compress those archives independently; then archive all the compressed archives without compression) — and then uploads these semi-random-access archives to a CDN or private BitTorrent tracker (or any other data delivery system that enables clients to only retrieve the blocks/byte-ranges of files they're interested in); 4. generate a TOC for the semi-random-access files, as a stream of tuples (signed archive URL, chunk byte-range, hostname, compressed URL-list); push these to a managed reliable message queue on an IaaS, publishing each entry to both an all-hostnames topic, and a per-hostname topic. (I say an IaaS, as this allows consumers to set up their own consumer-groups on these topics within their own IaaS project, and then pay the costs of message retention in these consumer-groups themselves.) 5. Also buffer these TOC-entry streams into files (e.g. Parquet files), one archive series per topic; and host these alongside the HAR archives. Prune TOC topic stream entries if (entries are at least N days old AND the entries have been successfully "offlined" into a hosted TOC-stream archive.) --- This "web-spidering-firehose data-lake as-a-Service" architecture, would enable pretty much anyone to build whatever arbitrary search index they want downstream of it, containing as much or as little of the web as they want — where each consumer only needs to do as much work as is required to fetch and parse the HARs of the domains they've decided they care about indexing something under. This architecture would also be "temporal" (akin to a temporal RDBMS table) — as a consumer of this service, you wouldn't see "the current version" of a scraped URL, but rather all previous attempts to scrape that URL, and what happened each time. (This would mean that no website could ever censor the dataset retroactively by adding a robots.txt "Disallow *" after scrapes have already happened. Their robots.txt config would prevent further scraping, but previous scraping would be retained.) And in fact, in this architecture, the HTTP interaction to retrieve /robots.txt for a domain, would produce a HAR transcript that would get archived like any other. Domains restricted from crawling by robots.txt, would still get regular HAR transcripts recorded of the result of checking that their /robots.txt still restricts crawling. (Reducing over these /robots.txt HAR transcripts is how a consumer-indexer would determine whether they should currently be showing/hiding a domain in their built index.) |
I'm not sure you would like the results of what you suggest - if you are really going to crawl everything indiscriminately, you will end up with a lot of rubbish. Just check out Common Crawl if you want to get an idea of what it would look like.