We actually used Node.JS' request module, combined with some NLP (using natural) in order to pick out the main content. This worked pretty well, but for our purposes we didn't need it to be perfect because anything like headers would be removed when we processed the content (not being full sentences).