| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by deathemperor 3136 days ago

I've just finished my research on web scraping for my company (took me about 7 days). I started with import.io and scrapinghub.com for point and click scraping to see if I could do it without writing codes. Ultimately, UI point and click scraping is for none-technical. There are many data you would find it hard to scrape. For example, lazada.com.my stores the product's SKU inside an attribute that looks like <div data-sku-simple="SKU11111"></div> which I couldn't get. import.io's pricing is also something. I need to pay $999 a month for accessing API data is just too high.

So I decided to use scrapy, the core of scrapinghub.com.

I haven't written much python before but scrapy was very easy to learn. I wrote 2 spiders and run on scrapinghub (their serverless cloud). Scrapinghub support jobs scheduling and many other things at a cost. I prefer scrapinghub because in my team we don't have DevOps. It also supports Crawlera to prevent IP banning, Portia for point and click (still in beta, it was still hard to use), and Splash for SPA websites but it's buggy and the github repo is not under active maintenance.

For DOM query I use BeautifulSoup4. I love it. It's jQuery for python.

For SPA websites I wrote a scrapy middleware which uses puppeteer. The puppeteer is deployed on Amazon Lambda (1m free request first 365 days, more than enough for scraping) using this https://github.com/sambaiz/puppeteer-lambda-starter-kit

I am planning to use Amazon RDS to store scraped data.