| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by iagovar 1990 days ago

I happen to scrape a lot of large websites (mostly forums currently) and that's messy enough to force you into learning tricks.

I didn't stumble upon into any (tabular, at least) dataset that wasn't very curated.

Keep in mind that I studied sociology, so stuff that is a given for most HN people isn't for me. I had to learn a lot of CSS (for selectors), regex (still hate it), what's OLAP and how to take advantage of it (DuckDB) and a lot of stuff I'm not even aware now.

But I remember taking courses in my Uni, and later on, with R and Python. It was interesting, but no matter how deep into the rabbit hole of weird models I learnt, it felt... IDK, shallow?

Imagine yourself pulling data out of a company ERP, with human filled data. It won't be a walk in the park, just make some logit models and call it a day. You'll spend a lot of time trying to understand what's going on. And then you perform the models or make a dashboard.

1 comments

giu 1990 days ago

Thanks a lot for your reply!

Scraping websites can be quite the messy business, since some websites change their document structure more often than others.

Nonetheless, it's still a very instructive activity and you can build quite the pipeline around it (scraping multiple websites, joining datasets, efficiently storing the data, etc.).

link

iagovar 1990 days ago

Yeah, when data piled up I had to think about how to store it, RAM, and a bunch of other things that I didn't have to consider with sample data. Specifically RAM and how to transform data without so much need of it was a concern for some time.

link

rohan_shah 1990 days ago

I am also currently learning to scrape forums. And I am a philosophy student. Could you point to some resources that helped you learn it better?

link

jmt_ 1990 days ago

Learning CSS selectors and HTML structure, inspect element and the other dev tools builtin to your browser, and something like BeautifulSoup (for static/non-JS heavy pages) and Selenium (JS and other complicated pages) is pretty key imo. My background in web dev helped me with the HTML stuff. Basically, you fire up the page in a browser, inspect element to see how you can use CSS selectors to uniquely identify that data, then using BeautifulSoup or Selenium to parse and interact with the DOM will cover most web scraping cases.

link

iagovar 1990 days ago

Are you looking for something specific? Most tools have documentation you can bang your head against.

link