| I happen to scrape a lot of large websites (mostly forums currently) and that's messy enough to force you into learning tricks. I didn't stumble upon into any (tabular, at least) dataset that wasn't very curated. Keep in mind that I studied sociology, so stuff that is a given for most HN people isn't for me. I had to learn a lot of CSS (for selectors), regex (still hate it), what's OLAP and how to take advantage of it (DuckDB) and a lot of stuff I'm not even aware now. But I remember taking courses in my Uni, and later on, with R and Python. It was interesting, but no matter how deep into the rabbit hole of weird models I learnt, it felt... IDK, shallow? Imagine yourself pulling data out of a company ERP, with human filled data. It won't be a walk in the park, just make some logit models and call it a day. You'll spend a lot of time trying to understand what's going on. And then you perform the models or make a dashboard. |
Scraping websites can be quite the messy business, since some websites change their document structure more often than others.
Nonetheless, it's still a very instructive activity and you can build quite the pipeline around it (scraping multiple websites, joining datasets, efficiently storing the data, etc.).