| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by LeonidBugaev 795 days ago
	One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability

6 comments

haddr 795 days ago

Some years ago I compared those boilerplate removal tools and I remember that jusText was giving me the best results out of the box (tried readability and few other libraries too). I wonder what is the state of the art today?

link

jot 794 days ago

This is worth having a look at: https://mixmark-io.github.io/turndown/

With some configuration you can get most of the way there.

link

asadalt 795 days ago

oh AI is optional here. I do use readability to clean the html before converting to .md.

link

jot 794 days ago

Last time I tried readability it worked well with articles but struggled with other kinds of pages. Took away far more content than I wanted it to.

link

IanCal 795 days ago

How do you achieve the same things without AI here using that tool?

link

chrisweekly 795 days ago

"How do you do it without AI" is a question I (sadly) expect to see more often.

link

IanCal 795 days ago

Feel free to answer then, how do you do the same functions this does with gpt(3/4) without AI?

Edit -

This is an excellent use of it, a free text human input capable of doing things like extracting summaries. It does not seem to be used at all for the basic task of extracting content, but for post filtering.

link

cactusfrog 794 days ago

I think “copy from a PDF” could be improved with AI. It’s been 30 years and I still get new lines in the middle of sentences when I try to copy from one.

link

IanCal 794 days ago

That's a great use case, you might be able to do this if you've got a copy and paste on the command line with

https://github.com/simonw/llm

In between. An alias like pdfwtf translating to "paste | llm command | copy"

link

genewitch 794 days ago

i've long assumed that is a "feature" of PDF akin to DRM. Making copying text from a PDF makes sense from a publisher's standpoint.

link

hombre_fatal 794 days ago

Meh, it’s just the “how does it work?” question. How content extractors work is interesting and not obvious nor trivial.

And even when you see how readability parser works, AI handles most of the edge cases that content extractors fail on, so they are genuinely superseded by LLMs.

link

fbdab103 794 days ago

I was honestly expecting it to be mostly black magic, but it looks like the meat of the project is a bunch of (surely hard won) regexes. Nifty.

link

nyokodo 794 days ago

> I was … expecting it to be mostly black magic, but … the meat of the project is a bunch of … regexes

Wait, regexes are the epitome of black magic. What do you consider as black magic?

link

fbdab103 794 days ago

Macros? Any situation where code edits other code?

Sure, I could not write a regex engine, but the language itself can be fine if you keep it to straightfoward stuff. Unlike the famous e-mail parsing regex.

link

foundzen 795 days ago

how is it compared to mozilla/readability?

link

asadm 794 days ago

it uses readibility but does some additional stuff like relink images to local paths etc., which I needed

link

foundzen 794 days ago

I have had challenges with readability. The output is good for blogs but when we try it for other type of content, it misses on important details even when the page is quite text-heavy just like blog.

link

asadalt 794 days ago

yeah that’s correct. i put a checkbox to disable readability filter if needed…

link