|
|
|
|
|
by ziziman
810 days ago
|
|
To scrape the websites, do you just blindly cut all of the HTML into defined size chunks or is there some more sophisticated logic to extract text of interest ? I'm wondering because most news websites now have a lot of polluting elements like popups, would they also go into the database ? |
|
So I think there may be some useless data in the vector, but that may not be a issue since it is coming from multiple sources (for simple question at least)