Hacker News new | ask | show | jobs
by vaskal08 1136 days ago
Hey, other dev on this project. This is a good catch, and we're aware of this issue. What it's doing is actually using a photo caption as part of the article, and we're working on removing the use of that in the summarization process.
3 comments

Their are news APIs

Start with those and then figure out how to scrape a site as your input and spit out the existing API format and you'll come in through a clever side route, essentially having a two phase assembly line.

Also this will allow users to customize their "feed" as a free side effect of the architecture and furthermore you'll be able to isolate your scraping -> API transform on a per site basis, also as a free consequence and lastly, you can parallelize the work much cleaner and even have the public add their own "transformer" for their favorite news site

Parsing pdfs or web semantically is really not an easy job, as I found in my own foray into LLM sumamrization.
Maybe image search and if the image is not novel, ignore it?
Good point (it seems to me), and if it's AI generated, (try to) ignore it too I guess
Why? If it is an AI generated image, it was generated from a text prompt, by the author of the article. Author had reviewed the image. The image is novel.

As long as this is novel content, it should be parsed, I think.

Maybe it depends. Let's say some thought have gone into writing the prompt, and the image (and image text?) then explains how something works or helps one understand the article better.

Or if the prompt to generate the image, doesn't include anything interesting that isn't in the article already. (F.ex. "generate a nature photo related to this article"?)