|
|
|
|
|
by stekern
2209 days ago
|
|
Cool project! I've recently created something similar for personal use. I have many websites (mainly webshops) I want to be notified about changes on, but they don't have RSS feeds, subscriptions or APIs than you can use. I set up a cron job that runs daily, scrapes websites according to some XPaths, and saves the results to a DB. If any new elements have appeared, an email will be sent out. The biggest challenge is handling false positives: being able to distinguish between a new element and e.g., a previously seen element with an updated title, description etc. For websites that directly expose what seems to be unique, server-side, identifiers in their HTML, using that as a primary key seem to work well. If that's not available, the href of the HTML element seem to be fairly static. Do you have any thoughts on the issue of false positives and unique identifiers? |
|
Generally though, I'm hoping users understand that feeds produced in this way could be a little more brittle than if the site offered its own feed.
One difference with your approach is that you have the data from previous fetches in your database. With Feed Creator everything related to producing the feed (source URL, selectors, filters, etc.) is embedded in the feed URL to avoid having to record data on the server. So each request is treated as if it's the first one - the server doesn't know if an item in the feed is new or old. If we referred to feed data from previous fetches, maybe we could let users introduce a delay before having a new item added to the feed. This might help in cases where a typo is spotted and corrected by the publisher minutes after publication. Can't think of a much better way of avoiding false positives at the moment though.