| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jka 1443 days ago

Roughly speaking, yep - Common Crawl provides a sizable chunk of web data (420 TiB uncompressed, over 3 billion unique URLs, as of May 2022; historic statistics here[1]), and is updated on monthly basis. Not near-real-time, true, albeit relatively fresh.

A question to ask could be: how often do users care about information from a few minutes ago, compared to information that has been available for a longer duration of time?

[1] - https://commoncrawl.github.io/cc-crawl-statistics/

2 comments

flexie 1443 days ago

Isn't that more a question of adding to the mix frequent scraping of

- a few thousand news-sites (like nyt.com, bbc.co.uk),

- a few thousand very popular blogs (based on what influencers people search for),

- a handful of social media sites (e.g. Twitter),

- a few hundred databases in areas like weather, airlines, sports (like ATP for people who look for Wimbledon results today)?

link

sudodude 1443 days ago

I mean, any time someone wants information on current or recent events is your use case right there. If you exclude news entirely, you could maybe disregard recent websites but I imagine that's statistically a pretty large portion of search.

link