| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chris_st 2257 days ago

Wow, what a great idea, and largely fantastic implementation! It really looks good.

Do RSS feeds have "just" text that you pull to get the article content, or are you parsing the webpage somehow? If so, how?

I've done something slightly (well, about 1% :-) similar for the "Popular" page on pinboard[0]. It used to have a line or two from the start of the article, using a webpage content extractor that got turned off, so I can't use it anymore.

One arguably nice thing about mine is that it's updated once a day, and it remembers what you've read yesterday, so new articles are marked, and you can flip a switch to see the new stuff first.

Thanks!

[0] https://pbpb.cls.cloud

1 comments

arussellsaw 2257 days ago

RSS feeds are a bit of a mess, but only due to each publisher's implementation being slightly different. the vast majority only send a headline and summary on the RSS feed, i then have to go and scrape and extract the article on the backend to populate the content, which is it's own challenge and i've not gotten it to work for a few sites yet.

This is also running in a semi-serverless container in Google Cloud Run (only costs me £1 a month!) so fetching and re-caching all of that when a new container is scheduled is painful, however it seems like state in the container is persisted longer than i initially thought, so it's good enough for now.

link

k1m 2257 days ago

You might have come across this already, but we maintain a collection of article extraction rules for various sites here https://github.com/fivefilters/ftr-site-config - it was adapted from a database maintained by Instapaper in its early days and today has contributions mainly from users and developers of an open source Instapaper/Pocket alternative called Wallabag: https://github.com/wallabag/wallabag

Also usable with a free version of Full-Text RSS available here: https://bitbucket.org/fivefilters/full-text-rss/src/master/

link

jkeuhlen 2257 days ago

This looks really interesting! I've been building an RSS reader for myself in my free time, and this will be really useful. I was wondering if you know anything about the legal implications around scraping full-content like this and packaging it up? I was planning to do it with some fun added-on features, but was worried it would be considered copy-right infringement (since I would basically be re-hosting other site's content without permission). And some websites outright ban this kind of usage in the TOS for their RSS feeds. For example, from the Washington Post[1]

> a. For any article, you may not display more text than we provide in the RSS feed.

[1]: https://www.washingtonpost.com/rss-terms-of-service/2012/01/...

link

chris_st 2257 days ago

AWESOME! Thanks so much, I'll look into that. I think it was the Instapaper service I was using back in the day.

link

chris_st 2257 days ago

Cool, thanks! I'm doing a lot with AWS serverless, it's a great way to go (and similarly cheap). Any chance you could open-source your scraping code?

link