|
|
|
|
|
by barbazfoo12
5143 days ago
|
|
4. Provide a compressed archive of the data the scrapers want and make it available. No one should have to scrape in the first place. It's not 1993 anymore. Sites want Google and others to have their data. Turns out that allowing scraping produced something everyone agrees is valuable: a decent search engine. Sites are being designed to be scraped by a search engine bot. This is silly when you think about it. Just give them the data already. There is too much unnecessary scraping going on. We could save a whole lot of energy by moving more toward a data dump standard. Plenty of examples to follow. Wikimedia, StackExchange, Public Resource, Amazon's AWS suggestions for free data sources, etc. |
|
In StackExchange's case, most of these are now a week or more old.
Maybe it's a good idea, but I'm not sure how many would want to dump their data on a daily basis to keep Google updated, when Google can quite easily crawl their sites as and when it needs to.