| HN Mirror

> Just check out Common Crawl if you want to get an idea of what it would look like.

It has a lot of rubbish, sure, but the reason that that matters with Common Crawl, is that Common Crawl isn't a continuous stream; it's rather a big monthly 100TB incremental deliverable, that makes up part of an even larger multi-petabyte whole dataset; where "using the Common Crawl dataset" mostly means relying on one of a few IaaS providers who've grabbed the whole thing and unpacked it into their serverless-data-warehouse cluster that you can run map-reduce jobs against.

A given consumer of this hypothetical web-scraping-results "firehose via a data lake" API, meanwhile, wouldn't need to drink from the entire firehose in order to "follow" live data. For many purposes, they could instead just drink from the much-lower-pressure URLs queue, to discover what has been scraped; and then schedule fetching just those things [or rather, the domain-and-time-bucketed archive-chunks that contain just those things].

Which, for many consumers, might end up a low-bandwidth-enough affair that the data could be delivered to them over the regular public Internet, without needing them to "move compute to data."

Consumers might still need a copy of the entire dataset to backfill their indexing system initially — and this might still require doing the "colocate to the IaaS cluster where the dataset is, and run a map-reduce job" thing — but that'd be a one-time bootstrapping process, not a periodic job that needs to be reliable.

(In fact, since it's so rare, the scraping-service provider could even take responsibility for running these jobs themselves, as a sort of single-shot PaaS. "Subscribe to the firehose and we'll help you to do a one-time map-reduce over our dataset to backfill your index, all costs on us. Just define a job using this here SDK and upload it to our dashboard; it'll be queued to run on our infra; and when it's done, you'll get emailed a link to an object-store snapshot of the ephemeral data warehouse the job populated.")

---

Also, to be clear, I wasn't intending to describe an infrastructure whose output is directly able to be used as the index of a search engine. It'd be quite useless for that, just as Common Crawl would be. Such a dataset still needs curation.

It's just that, as with Common Crawl, the curation step should rightfully be the (direct, B2B) consumer's responsibility — because there are many different use-cases such data can be put to, that require different curation strategies:

• general whole-web search engines (obviously)

• site-specific search engines

• vertical-specific search engines (think: Google Scholar; FrogFind)

• format-specific global aggregators (e.g. a PubSubHubbub gateway that pre-discovers RSS feeds; a Matrix server that discovers and suggests other Matrix servers; or that old idea of an "Internet Yellow Pages" built out of people's VCard-RDF-microformat contact data embedded in XHTML — but now extended to the proprietary pseudo-microformats of various "about me" landing-page services)

• "see previous versions" services like the Wayback Machine (taking advantage of the immutability of the historical HARs in the data stream)

• a Shodan-like deep-web "discover what doesn't want to be discovered" service, surfacing websites with Disallow * robots.txt rules

• web analytics (like you can do with Common Crawl, but live, using scalable OLTP methods)

• continuous updating of ML models with "up to date" knowledge of the world (at least, once we figure out how to continuously train ML models)

There's really a lot you can do with what's essentially a periodic high-level packet dump of the result of poking every URL you can find, as often as it is willing to let you.