One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability
Some years ago I compared those boilerplate removal tools and I remember that jusText was giving me the best results out of the box (tried readability and few other libraries too). I wonder what is the state of the art today?
Feel free to answer then, how do you do the same functions this does with gpt(3/4) without AI?
Edit -
This is an excellent use of it, a free text human input capable of doing things like extracting summaries. It does not seem to be used at all for the basic task of extracting content, but for post filtering.
I think “copy from a PDF” could be improved with AI. It’s been 30 years and I still get new lines in the middle of sentences when I try to copy from one.
Meh, it’s just the “how does it work?” question. How content extractors work is interesting and not obvious nor trivial.
And even when you see how readability parser works, AI handles most of the edge cases that content extractors fail on, so they are genuinely superseded by LLMs.
Macros? Any situation where code edits other code?
Sure, I could not write a regex engine, but the language itself can be fine if you keep it to straightfoward stuff. Unlike the famous e-mail parsing regex.
I have had challenges with readability. The output is good for blogs but when we try it for other type of content, it misses on important details even when the page is quite text-heavy just like blog.