Hacker News new | ask | show | jobs
by 300bps 4490 days ago
I wonder how Google chooses which Wikipedia articles they scrape and which ones they don't.

In testing, they definitely don't seem to scrape every article:

http://i.imgur.com/ujDqZhB.png

1 comments

This is a good question...I've long since surmised that Google has a set of heuristics for every site that has an API that allows for easy domain-specific ranking. With Wikipedia, you have number of page edits, frequency of page edits, and (to an extent) quality of recent page edits. StackOverflow provides an even easier metric for what's considered high quality, and Google appears to apply its own layer on top of that (and in my non-scientific perception, looking something up by Google is almost always more fruitful on the first search than by going directly to SO)