Hacker News new | ask | show | jobs
by dchuk 10 days ago
I built a site that's similar in concept to Hacker News, but is entirely fed by RSS feed content, that is then bullet-pointed summarized on the article page: https://engineered.at/

But I also extract topics automatically from the content too with LLMs, to allow for dynamic topic pages that users can separately subscribe to to tune their feeds.

Haven't promoted it much, but it's pretty amazing what you can do for a couple bucks a month. And my main thesis with this site is that by locking the content to only rss feeds of known blogs, you dramatically reduce the spam submission risk (basically eliminate it). Doesn't handle the spam comment side of things, but that's a different problem.

EDIT: I also open sourced a Rails engine I made to power this site if anyone is interested: https://github.com/dchuk/source_monitor

4 comments

This looks great, I've wanted something like this for a while. Finding how to click through to the actual item in the feed was a high point of friction for me.

I went to a topic and then clicked on the header of something I was interested in expecting to be brought to the blog post directly. Needing to click on that same title again to be brought to the post was unintuitive to me, I searched around the page, went back and forth a few times and eventually figured it out.

As a user I would love to be able to click directly through to the article FROM the topic feed. I would expect that the comments is a URL to the page that the header currently brings me to. This would match my expectations from using sites like reddit/HN.

A one or two liner summary directly on the topics feed would be really great I think.

Great feedback, should be straightforward to make happen. I’ll try to implement tonight.
As a sysadmin hosting a few blogs, do you mind sharing what IP ranges you crawl from? Or what agent your requests use? Thank you.
I presume you’re politely asking in order to block? Which is fine, I get it. On my phone right now but can update later.

I do want to ask though (and I should make this clear in a FAQ or something): the way I check RSS feeds uses adaptive scheduling, so I intentionally don’t check feeds of sites too rapidly. Then the summarization is based on the full article content but I never render that full content on the site (to avoid traffic hijacking concerns). Given that: what’s the concern?

I do appreciate you addressing the concerns about traffic hijacking, but at the same time I really don't like having my content run through a text mangler like an LLM. I get the use case, but at the end of the day it's my content and I'm a bit prickly.

That said, I'm not necessarily planning to immediately block your crawlers, I intend to just add them to a list I maintain for personal reference. I'm mostly interested in correlating the crawling traffic that I see with various sources, I have been gathering data about crawling activity and sources that I display on an embedded map on my site. I have caddy annotate traffic with a header indicating what the crawler is, and if the fleet behaves nicely then they don't get added to the blocklist.

Interesting. in terms of "crawling", the way the engine I built works is by default it's just polling the rss feed of a site on an adjusting cadence like any other rss feed reader. On some sites, the engine can do a follow up scrape of the article link from the rss feed if the full content of the article isn't provided in the rss feed. So it's not real crawling, more fetching/scraping if necessary.

But I hear you.

  Your browser is not supported.
  Please upgrade your browser to continue.
Can't even view your site with Firefox
That’s…bizarre. Let me take a look

EDIT: just checked in firefox, I don't see an issue. can you email me at me@dchuk.com and maybe I can debug with you?

I just noticed the same thing.

UA being blocked for example:

  Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:140.0) Gecko/20100101 Firefox/140.0
Did mess with it some more:

Allowed:

    Opera/9.80 (Windows NT 6.1; U; zh-tw) Presto/2.7.62 Version/11.01
    Opera/9.80 (Windows NT 5.1; U; cs) Presto/2.7.62 Version/11.01
406:

    Mozilla/5.0 (Windows NT 5.1) Gecko/20100101 Firefox/14.0 Opera/12.0
    Mozilla/5.0 (Macintosh; Intel Mac OS X 14; rv:140.0) Gecko/20110101 Firefox/140.0
Maybe just remove it?
Figured it out, had a random block of Firefox versions less than 147 in my ApplicationController for some reason. Of course my home internet went down though so I’ll push in a few.
ok this should now be fixed!
Thanks for this info! Very helpful
Getting

    406 browser not supported
for ESR Firefox 140.

If I set my UA to "FUCKIT" I can use the site perfectly fine. Why is there a User Agent Filter that disables the whole website? This should be maybe a warning, not a complete block.

you know, I had setup some analytics filtering based on geoip because I was getting crazy spam traffic from Chine and Singapore, but that should only be affecting analytics not the whole site. Mind if I ask where you're located? (you can email me privately if preferred: me@dchuk.com)
Europe

IP address has no effect on the User Agent block though...

Yeah I know and agree, just wondering if something is haywire in that logic somehow. Otherwise it’s a bizarre issue but I’ll get it fixed
Glad to hear, and neat site. Cool to see new Ruby on Rails sites. Thought I was the only one still loving it. ;)