| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anigbrowl 2222 days ago

You need to be more specific about what part you're having a problem with and what your goal is: to build a scraper you can sell or give away, to accumulate social media data for commercial purposes, or some research goal?

There's no generic solution, since every platform is different, and there's no one scraping library (or approach) to rule them all. Most efforts I've seen use BeautifulSoup to parse web pages and/or Selenium to automate browser actions, but I'm sure there are better alternatives. It is a frustrating space to work in as many/most tools are limited and the methods jealously guarded, much as most social media companies jealously guard the data they harvest.

You could probably learn a lot by leveraging existing tools and seeing what you can do on the analysis side. Twitter has a fairly well-specified API and if you are getting frustrated with the limits of that, there's twint. Facebook is the biggest 'pile' of data but they know it and when you look at the source a FB page you can see there's a lot of stuff that messes up your ability to parse that data, accidentally or deliberately. You might be better providing tooling for small but growing social media platforms that are not as big (and so less valuable/profitable to scrape) but also don't have the accumulated digital sediment that makes it difficult to do so.