Hacker News new | ask | show | jobs
by derefr 710 days ago
> It’s you who chooses what sites we crawl

Yeah, but you still reserve the right to not crawl sites (or to remove them from your index), yes? So there's still the opportunity to do evil.

I'm still waiting for a "raw" search spidering provider. One that:

1. runs a web-spidering cluster — one that's only smart enough to know what robots.txt is, to know how to follow links in HTML pages, and to obey response caching-policy headers;

2. captures the spidering process losslessly, as e.g. HAR transcript files;

3. packs those HAR transcript files, a few million at a time, into tar.xz.tar files (i.e. grab a "chunk" of N HAR files; group them into subdirs by request Host header; archive each subdir, and compress those archives independently; then archive all the compressed archives without compression) — and then uploads these semi-random-access archives to a CDN or private BitTorrent tracker (or any other data delivery system that enables clients to only retrieve the blocks/byte-ranges of files they're interested in);

4. generate a TOC for the semi-random-access files, as a stream of tuples (signed archive URL, chunk byte-range, hostname, compressed URL-list); push these to a managed reliable message queue on an IaaS, publishing each entry to both an all-hostnames topic, and a per-hostname topic. (I say an IaaS, as this allows consumers to set up their own consumer-groups on these topics within their own IaaS project, and then pay the costs of message retention in these consumer-groups themselves.)

5. Also buffer these TOC-entry streams into files (e.g. Parquet files), one archive series per topic; and host these alongside the HAR archives. Prune TOC topic stream entries if (entries are at least N days old AND the entries have been successfully "offlined" into a hosted TOC-stream archive.)

---

This "web-spidering-firehose data-lake as-a-Service" architecture, would enable pretty much anyone to build whatever arbitrary search index they want downstream of it, containing as much or as little of the web as they want — where each consumer only needs to do as much work as is required to fetch and parse the HARs of the domains they've decided they care about indexing something under.

This architecture would also be "temporal" (akin to a temporal RDBMS table) — as a consumer of this service, you wouldn't see "the current version" of a scraped URL, but rather all previous attempts to scrape that URL, and what happened each time. (This would mean that no website could ever censor the dataset retroactively by adding a robots.txt "Disallow *" after scrapes have already happened. Their robots.txt config would prevent further scraping, but previous scraping would be retained.)

And in fact, in this architecture, the HTTP interaction to retrieve /robots.txt for a domain, would produce a HAR transcript that would get archived like any other. Domains restricted from crawling by robots.txt, would still get regular HAR transcripts recorded of the result of checking that their /robots.txt still restricts crawling. (Reducing over these /robots.txt HAR transcripts is how a consumer-indexer would determine whether they should currently be showing/hiding a domain in their built index.)

1 comments

Yes, we veto sites to prevent spam.

I'm not sure you would like the results of what you suggest - if you are really going to crawl everything indiscriminately, you will end up with a lot of rubbish. Just check out Common Crawl if you want to get an idea of what it would look like.

> Just check out Common Crawl if you want to get an idea of what it would look like.

It has a lot of rubbish, sure, but the reason that that matters with Common Crawl, is that Common Crawl isn't a continuous stream; it's rather a big monthly 100TB incremental deliverable, that makes up part of an even larger multi-petabyte whole dataset; where "using the Common Crawl dataset" mostly means relying on one of a few IaaS providers who've grabbed the whole thing and unpacked it into their serverless-data-warehouse cluster that you can run map-reduce jobs against.

A given consumer of this hypothetical web-scraping-results "firehose via a data lake" API, meanwhile, wouldn't need to drink from the entire firehose in order to "follow" live data. For many purposes, they could instead just drink from the much-lower-pressure URLs queue, to discover what has been scraped; and then schedule fetching just those things [or rather, the domain-and-time-bucketed archive-chunks that contain just those things].

Which, for many consumers, might end up a low-bandwidth-enough affair that the data could be delivered to them over the regular public Internet, without needing them to "move compute to data."

Consumers might still need a copy of the entire dataset to backfill their indexing system initially — and this might still require doing the "colocate to the IaaS cluster where the dataset is, and run a map-reduce job" thing — but that'd be a one-time bootstrapping process, not a periodic job that needs to be reliable.

(In fact, since it's so rare, the scraping-service provider could even take responsibility for running these jobs themselves, as a sort of single-shot PaaS. "Subscribe to the firehose and we'll help you to do a one-time map-reduce over our dataset to backfill your index, all costs on us. Just define a job using this here SDK and upload it to our dashboard; it'll be queued to run on our infra; and when it's done, you'll get emailed a link to an object-store snapshot of the ephemeral data warehouse the job populated.")

---

Also, to be clear, I wasn't intending to describe an infrastructure whose output is directly able to be used as the index of a search engine. It'd be quite useless for that, just as Common Crawl would be. Such a dataset still needs curation.

It's just that, as with Common Crawl, the curation step should rightfully be the (direct, B2B) consumer's responsibility — because there are many different use-cases such data can be put to, that require different curation strategies:

• general whole-web search engines (obviously)

• site-specific search engines

• vertical-specific search engines (think: Google Scholar; FrogFind)

• format-specific global aggregators (e.g. a PubSubHubbub gateway that pre-discovers RSS feeds; a Matrix server that discovers and suggests other Matrix servers; or that old idea of an "Internet Yellow Pages" built out of people's VCard-RDF-microformat contact data embedded in XHTML — but now extended to the proprietary pseudo-microformats of various "about me" landing-page services)

• "see previous versions" services like the Wayback Machine (taking advantage of the immutability of the historical HARs in the data stream)

• a Shodan-like deep-web "discover what doesn't want to be discovered" service, surfacing websites with Disallow * robots.txt rules

• web analytics (like you can do with Common Crawl, but live, using scalable OLTP methods)

• continuous updating of ML models with "up to date" knowledge of the world (at least, once we figure out how to continuously train ML models)

There's really a lot you can do with what's essentially a periodic high-level packet dump of the result of poking every URL you can find, as often as it is willing to let you.