|
|
|
|
|
by jka
1443 days ago
|
|
Roughly speaking, yep - Common Crawl provides a sizable chunk of web data (420 TiB uncompressed, over 3 billion unique URLs, as of May 2022; historic statistics here[1]), and is updated on monthly basis. Not near-real-time, true, albeit relatively fresh. A question to ask could be: how often do users care about information from a few minutes ago, compared to information that has been available for a longer duration of time? [1] - https://commoncrawl.github.io/cc-crawl-statistics/ |
|
- a few thousand news-sites (like nyt.com, bbc.co.uk),
- a few thousand very popular blogs (based on what influencers people search for),
- a handful of social media sites (e.g. Twitter),
- a few hundred databases in areas like weather, airlines, sports (like ATP for people who look for Wimbledon results today)?