| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jake-low 2317 days ago

This is phenomenal. I've been scraping the data from primary sources for just Washington state for the past week [0], in order to make this chart which I hacked together last weekend [1].

[0]: https://github.com/jake-low/covid-19-wa-data

[1]: https://observablehq.com/@jake-low/covid-19-in-washington-st...

Doing this for just one state was a pretty substantial effort. I imagine there are multiple people at the Times who are spending several hours a day reviewing and cleaning scraped data (seems every couple of days some formatting change breaks your scripts, or a source publishes data that later needs to be retracted).

The Times dataset appears to contain per-county case and death observations in a time series, going all the way back to the first confirmed U.S. case in January in Snohomish County, WA. This makes it by far the most comprehensive time series dataset of U.S. COVID-19 cases publicly available.

Some people in this thread linked to the Johns Hopkins CSSE dataset; I've looked at this data but it doesn't go back very far in time for the U.S., and the tables are published as daily summaries with differing table schemas which makes them hard to use out of the box. For some days earlier in March, "sublocations" aren't even structured (for example the same column contains, "Boston, MA" and "Los Angeles County", making it very hard to use). No disrespect to the team behind the JHU dataset; it attempts to cover the whole world since the outbreak began which is an incredible and difficult goal. But for mapping and studying the outbreak in the U.S., the Times dataset will likely be the best choice right now.

Huge kudos to the New York Times team for making this data freely available.