Hacker News new | ask | show | jobs
by wasi0013 3314 days ago
If you are familiar with python then try scrapy[1]. You can also scrape websites using beautifulsoup4[2]

[1] https://scrapy.org/ [2] https://pypi.python.org/pypi/beautifulsoup4

1 comments

We actually used Node.JS' request module, combined with some NLP (using natural) in order to pick out the main content. This worked pretty well, but for our purposes we didn't need it to be perfect because anything like headers would be removed when we processed the content (not being full sentences).