Hacker News new | ask | show | jobs
by rprameshwor 3307 days ago
I'm interested to know more about the tools/techniques you used to scrap the sites and clean the results. I am working on scraping contents from a couple of sites and it seems to be a pain, probably because i don't have lot of experience around it.
1 comments

If you are familiar with python then try scrapy[1]. You can also scrape websites using beautifulsoup4[2]

[1] https://scrapy.org/ [2] https://pypi.python.org/pypi/beautifulsoup4

We actually used Node.JS' request module, combined with some NLP (using natural) in order to pick out the main content. This worked pretty well, but for our purposes we didn't need it to be perfect because anything like headers would be removed when we processed the content (not being full sentences).