Hacker News new | ask | show | jobs
by billyhoffman 1183 days ago
I too was using text-only versions of sites like CNN, Reuters, or Christian Science Monitor[1], and they were fine. But what I really wanted was to turn any news website into a text-only website.

So I build NewsWaffle, which for any website:

https://github.com/acidus99/NewsWaffle

* Automatically builds a list of news stores, separate from the navigational hyperlinks.

* Detects RSS/Atom feeds to provide a more accurate list of news stories.

* Uses Readability to show only article content on article pages.

* Uses meta data like OpenGraph or Twitter cards to provide richer formatting, and to determine page type.

It regularly converts 900 KB home pages or 1.2 MB news articles into into 3KB for links to news stories and 5K of text

It does this by:

* Using semantic tags like <header>, <footer>, and <nav> to determines which hyperlinks are navigational and which ones are likely links to news articles.

* OpenGraph meta data to determine page type news stories and extra metadata.

* A Aggressive HTML parser that strips out a ton of tags, CSS, JS, etc

* Readability library to extract out the text of news articles

I built this as a service in Gemini, so if you have a gemini browser you can try it. Otherwise, here is a HTTP-to-gemini proxy showing you what a NYT article looks like:

Gemini link: gemini://gemi.dev/cgi-bin/waffle.cgi/

NYT Homepage: https://portal.mozz.us/gemini/gemi.dev/cgi-bin/waffle.cgi/li...

NYT Article: https://portal.mozz.us/gemini/gemi.dev/cgi-bin/waffle.cgi/ar...

[1] https://www.csmonitor.com/text_edition

6 comments

Pretty amazing.

I tested aldaily.com and had trouble navigating to get to the articles. Allsides.com worked. Techmeme.com did not work.

gemini://gemi.dev/cgi-bin/waffle.cgi/links?https%3A%2F%2Fallsides.com%2F

https://portal.mozz.us/gemini/gemi.dev/cgi-bin/waffle.cgi/li...

Thanks for letting me know. aldaily works great in raw mode:

gemini://gemi.dev/cgi-bin/waffle.cgi/raw?https%3A%2F%2Fwww.aldaily.com%2F

Clicking on the "more" links which take you to the news articles also works properly as well.

(you can get to raw mode by clicking "Force article view" and then "raw mode." I should probably expose that in other places)

NewsWaffle tries to determine the type of page. Articles get displayed with content run through readability, and then the HTML is stripped down. If its a "links" page, like the home or section page on a news site, it using HTML elements to try and find links to news stories vs navigational links to other parts of the site. Part of that is looking for links with longer text, since link text to news stories tend to be a few words. This helps sort "About Us" from "New Fusion Experiment a Success"). I'll check into why aldaily isn't working properly

Sorry I can't seem to reproduce the Techmeme issue. It works for me:

gemini://gemi.dev/cgi-bin/waffle.cgi/view?https%3A%2F%2Fwww.techmeme.com

Do the techmeme links click through?
This is fantastic, now I can view news in Gemini all day. Thank you, we need more gemini sites or tools to convert HTML to it.
What are you using for a Gemini client? Lynx handles Gopher URLs, so I presumed it would be OK with Gemini, but no luck.

Any suggestions?

For the terminal, I use amfora: https://github.com/makew0rld/amfora

For a GUI, I use Lagrange: https://github.com/skyjake/lagrange

Lagrange is sort of the Netscape of Gemini. It works on all the major desktop and mobile OSes. Personally prefer Elaho (iOS) or Buran (Android) for mobile

Absolutely great! It makes https://antiwar.com work better than the actual website.
Great!

A request: In the linked NY Times front page, more formatting for the article list, maybe blank lines between articles. Visually, it's a challenge.

I didn't know I needed this so much.
This is excellent! Wow.