| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by barbazfoo12 5189 days ago

4. Provide a compressed archive of the data the scrapers want and make it available.

No one should have to scrape in the first place.

It's not 1993 anymore. Sites want Google and others to have their data. Turns out that allowing scraping produced something everyone agrees is valuable: a decent search engine. Sites are being designed to be scraped by a search engine bot. This is silly when you think about it. Just give them the data already.

There is too much unnecessary scraping going on. We could save a whole lot of energy by moving more toward a data dump standard.

Plenty of examples to follow. Wikimedia, StackExchange, Public Resource, Amazon's AWS suggestions for free data sources, etc.

2 comments

FuzzyDunlop 5189 days ago

One might argue that indexing from a data-dump will lead to search results that are only as up to date as the last dump.

In StackExchange's case, most of these are now a week or more old.

Maybe it's a good idea, but I'm not sure how many would want to dump their data on a daily basis to keep Google updated, when Google can quite easily crawl their sites as and when it needs to.

link

barbazfoo12 5189 days ago

Have you considered rysnc? Dropbox uses it. So lots of people who don't even know what rsync is are now using it. We could all be using it for much more than just Dropbox. And if you have ever used gzip on html you know how well it compresses. The savings are quite substantial. Do you think most browsers are normally requesting compressed html?

link

minikomi 5189 days ago

It could be /data.zip like /robots.txt

link