| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vaskal08 1136 days ago
	Hey, other dev on this project. This is a good catch, and we're aware of this issue. What it's doing is actually using a photo caption as part of the article, and we're working on removing the use of that in the summarization process.

3 comments

kristopolous 1135 days ago

Their are news APIs

Start with those and then figure out how to scrape a site as your input and spit out the existing API format and you'll come in through a clever side route, essentially having a two phase assembly line.

Also this will allow users to customize their "feed" as a free side effect of the architecture and furthermore you'll be able to isolate your scraping -> API transform on a per site basis, also as a free consequence and lastly, you can parallelize the work much cleaner and even have the public add their own "transformer" for their favorite news site

link

lxe 1135 days ago

Parsing pdfs or web semantically is really not an easy job, as I found in my own foray into LLM sumamrization.

link

startupsfail 1135 days ago

Maybe image search and if the image is not novel, ignore it?

link

cutemonster 1135 days ago

Good point (it seems to me), and if it's AI generated, (try to) ignore it too I guess

link

startupsfail 1134 days ago

Why? If it is an AI generated image, it was generated from a text prompt, by the author of the article. Author had reviewed the image. The image is novel.

As long as this is novel content, it should be parsed, I think.

link

cutemonster 1133 days ago

Maybe it depends. Let's say some thought have gone into writing the prompt, and the image (and image text?) then explains how something works or helps one understand the article better.

Or if the prompt to generate the image, doesn't include anything interesting that isn't in the article already. (F.ex. "generate a nature photo related to this article"?)

link