Hacker News new | ask | show | jobs
by gus_massa 3990 days ago
Some blog have standard end paragraph like "If you have read all of this, you may like to subscribe to my rss", or "We are always hiring at ABC, send your resume." Another problem are short captions that look like a paragraph for the html parser, like "Advertisment" or "XYZ Benchmark (higher is better)". One possible solution is to skip the paragraphs that have less than ¿150? letters.
1 comments

I agree that it is quite reasonable to ignore paragraphs that are fewer than 3 sentences.