Hacker News new | ask | show | jobs
by ziziman 810 days ago
To scrape the websites, do you just blindly cut all of the HTML into defined size chunks or is there some more sophisticated logic to extract text of interest ?

I'm wondering because most news websites now have a lot of polluting elements like popups, would they also go into the database ?

1 comments

If you look at the vector handler in his code, he is using blue Monday sanitizer and doing some "replaceAll".

So I think there may be some useless data in the vector, but that may not be a issue since it is coming from multiple sources (for simple question at least)