| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ziziman 810 days ago
	To scrape the websites, do you just blindly cut all of the HTML into defined size chunks or is there some more sophisticated logic to extract text of interest ? I'm wondering because most news websites now have a lot of polluting elements like popups, would they also go into the database ?

1 comments

totolouis 810 days ago

If you look at the vector handler in his code, he is using blue Monday sanitizer and doing some "replaceAll".

So I think there may be some useless data in the vector, but that may not be a issue since it is coming from multiple sources (for simple question at least)

link