| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ttgurney 1472 days ago

Funny seeing this here as I've been thinking a lot about text-based browsers lately. Just a couple days ago I tried to build this one from source, but I put it aside due to the dependencies on PCRE and a JavaScript engine. (I am running a hand-rolled Linux "distro" so I can't just install ready-made binary packages.)

I do really appreciate that this one uses libcurl on the backend. Surprisingly few browsers do this--Lynx, Links, and w3m all have their own networking code. They have bespoke HTML parsing and rendering as well. I'm lately thinking I want to see a text-mode browser that just glues together libcurl, curses, simple HTML rendering, and maybe an existing HTML parsing library. No text-based HTML rendering library exists that I'm aware of.

Also these classic text browsers have their own implementations of FTP, NNTP, and some other legacy cruft. I'm thinking most of this could easily be provided by libcurl (if at all).

4 comments

shiomiru 1471 days ago

> I'm lately thinking I want to see a text-mode browser that just glues together libcurl, curses, simple HTML rendering, and maybe an existing HTML parsing library.

I had a similar idea a while ago, except mine was to glue together components from the nim stdlib.

So I wrote something like that, then I thought "hey, why not implement some CSS too?" and that sent me down the rabbit hole of writing an actual CSS-based layout engine... I eventually also realized that the stdlib html parser is woefully inadequate for my purposes.

In the end, I wrote my own mini browser engine with an HTML5 parser and whatnot. Right now I'm trying to bring it to a presentable state (i.e. integrate libcurl instead of using the curl binary, etc.) so I can publish it.

Anyways, if there's a moral to this story it's that writing a browser engine is surprisingly fun, so go for it :)

ttgurney 1471 days ago

I look forward to seeing your 1st release of this program!

> Anyways, if there's a moral to this story it's that writing a browser engine is surprisingly fun, so go for it :)

Good to know. I'd been fairly intimidated by the idea.

augusto-moura 1472 days ago

It depends on quickjs for the JavaScript implementation, which should be fairly simpler to compile on a hand rolled Linux. I'm not so sure about PCRE though

ttgurney 1471 days ago

Oh I'm sure the actual work to compile those packages is not much. It's more to do with keeping the number of packages on my system to a minimum.

Actually I would not be surprised if the JavaScript engine can be omitted with just a little bit of patching work... assuming there's not actually a build configuration that leaves it out. I've found that with some software projects and their dependencies, "required" does not always mean required.

smaudet 1471 days ago

Call it Unixy or something - unix philosophy of having each program do something separate.

Makes more sense, that's what this guy does anyways with the js engine?

> Surprisingly few browsers do this--Lynx, Links, and w3m all have their own networking code

I think people are suspicious of curl because it is a common utility, and they think it can't possibly have got it right - plus there's something mildly fun about figuring out how to monitor a socket and send/receive IP packets for the first time.

I have played around a bit with the Curl code a bit, in part I also suspect other programs do it to get "closer" i.e. being able to manage/dispatch events from a thread directly instead of some signal from a curl thread, probably something about security and thread safety too...

shiomiru 1471 days ago

The main reason for the aforementioned browsers not using libcurl is mostly historical, as it simply didn't exist back when they were created. (The newest of them is links, first released in 1999 - and according to the curl website, the first libcurl release with a proper interface was in 2000.)

w3m even uses its own regex engine for search, because there was no free regex engine with Japanese support the author could've used back then.

1vuio0pswjnm7 1471 days ago

https://github.com/google/oss-fuzz-vulns/tree/main/vulns/cur...

https://github.com/curl/curl/commit/68ffe6c17d6e44b459d60805...

https://www.cvedetails.com/product/25084/Haxx-Curl.html?vend...

Instead of only "thinking a lot about text-based browsers", I have been actively using them on a daily basis for the past 26 years.

Links already uses ncurses. I am glad that it does not use libcurl and that it has its own "bespoke" HTML rendering. In over 25 years time, I still have yet to see any other program produce better rendering of HTML tables as text. I have had few if any problems with Links versions over the years. I am quite good at "breaking" software and for me Links has been quite robust. The source code is readable for me and I have been able to change or "fix" things I do not like, then quickly recompile. I can remove features. Recently I fixed a version of the program so that a certain semantic link would not be shown in Wikipedia pages. No "browser extension" required.

Links' rendering has managed to keep up with the evolution of HTML and web design sufficiently for me. Despite the enormous variation in HTML acrosse the www, there are very few cases where the rendering is unsatisfactory.^1 I cannot say the same for other attempts at text-only clients. W3C's libwww-based line-mode browser still compiles and works,^2 although I would not be satisifed with its rendering. Nor would I be satisfied with edbrowse, or something simpler such as mynx.^3

I use Links primarily for reading and printing HTML. I use a variety of TCP clients for making HTTP requests, including djb's tcpclient which I am quite sure beats libcurl any day of the week in terms quality, e.g., the programming skill level of the author and the care with which it was written. This non-libcurl networking code is relatively small and does not need oss-fuzz. I do not intentionally use libcurl. It is too large and complex for my tastes. For TLS, I mainly use stunnel and haproxy.

1. One rare example I can recall is https://archive.is

2. https://github.com/w3c/libwww

3. https://github.com/SirWumpus/ioccc-mynx

ttgurney 1471 days ago

Hey thanks for your perspective and a couple of mentions of software I'd not heard of (like tcpclient).

I agree that curl is pretty big and bloated. I would not call it a deficiency that Links et al. don't depend on it.

I mostly just was thinking that since I already have curl on my system, it'd be nice to have a browser that reuses that code. Especially since curl has upstream support for the much smaller BearSSL rather than depending on OpenSSL/LibreSSL.

1vuio0pswjnm7 1471 days ago

Apologies if I misunderstood.

I like the idea of BearSSL but it has no support for TLS1.3.

I am not a fan of TLS but alas it is unavoidable on today's www. Keeping up with TLS seems like a PITA for anyone maintaining an OpenSSL alternative or even a TLS-supported application.

This is why I pick stunnel and haproxy. These are applications that seem to place a high priority on staying current. Knock on wood. I am open to suggestions for better choices if they exist.

There are many TCP clients to choose from. Before TLS took over the www, it was more popular to write one's own netcat.

I have focused on writing helper applications to handle the generation of HTTP. Thus I can use any TCP client, including old ones that do not support TLS.

The "web browser" is really the antithesis of the idea underlying UNIX of small programs that do more or less only one thing. Browsers try to do _everything_.

This is not appealing to me. I try to split information retrieval from the www into individual tasks. For example,

   1. Extracting URLs from text/html 
   2. Generating HTTP requests
   3. Sending HTTP requests via TCP 
   4. Forwarding requests over TLS 
   5. (a) Reading/printing HTML or (b) extracting MIME filetypes such as PDF, GZIP or JPG

The cURL project's curl binary combines all these steps. It has a ridiculous number of options that just keeps growing.

For me, step 5 really does not need to be combined with steps 1-4 into the same binary. I am able to do more when the steps are separated because it allows me more flexibility. To me, one of the benefits of the "UNIX philosophy" is such flexibility. No individual program needs to have too many options, e.g., like curl. Programs can be used together in creative ways. I see the presence of a large number of options in a program like curl as _limiting_, and creating liabilities. If the author has not considered it as something a user "should" want to do, then the program cannot do it. Adding large numbers of options is also a way of catering to a certain type of user with which I generally do not agree. It is a form of marketing.

For step 4, curl is overkill. It has always suprised me that UNIX has not included a small utility to generate HTTP. Thus, I wrote one.

For step 5(a), Links has served me well. I am open to suggestions for a better choice but there are few people online who are _actual_ daily text-only www users that comment about the experience.^1 An HTML reader/printer, without any neworking code, is another small program that should be part of UNIX.

For step 5(b) I have written and continue to write small programs to do this, sort of like file carvers such as foremost but better, IMO. However I will often use tnftp for convenience.

I used tnftp for many years as the default ftp client on NetBSD and prefer it over (bloated) curl or wget. It is small enough that I can edit and re-compile if I want to change something. Because it comes from NetBSD project the source code is very easy on the eyes.

1. IMO, no sane _daily_ text-only www user today would use Lynx. Whenever anyone mentions it as a text-only browser option then I believe that person is not likely to be a _daily_ text-only www user. Lynx is bloated and slow compared to Links and the rendering is inferior, IMHO.

marttt 1470 days ago

> ... I have written and continue to write small programs to do this ...

Would you mind sharing some of that code?

Some of your recent comments on web browsers, text browsers and javascript [1 + its follow-up] are really interesting. Thanks for sharing.

1: https://news.ycombinator.com/item?id=32131901

1vuio0pswjnm7 1469 days ago

Below is one for PDF. Compile the 052.l file with something like

     flex -8iCrf $1;
     cc -O3 -std=c89 -W -Wall -pedantic -I$HOME -pipe lex.yy.c -static -o yy${x%.l};
     strip -s yy${x%.l};
     test -d yy||mkdir yy;
     export PATH=$PATH:$HOME/yy;
     exec mv yy${x%.l} yy;

"yy045" is a small program to remove chunked transfer encoding.

These programs are to be used in pipelines, something like

      echo https://www.bezem.de/pdf/ReservedWordsInC.pdf|yy025|nc -vv h1b 80|yy052 >1.pdf

"h1b" is a HOSTS file entry for a localhost TLS-enabled forward proxy

"yy025" is a small program that generates HTTP.

Interestingly I think curl was modified in recent years to detect binary data on stdin. I just tested the following and it extracted the PDF automatically.

       curl https://www.bezem.de/pdf/ReservedWordsInC.pdf > 1.pdf

However, one thing that curl does _not_ do is HTTP/1.1 pipelining. I use pipelining on a daily basis. That is where these programs become useful for me.

       cat > 052.l

       /* PDF file carver */
       /* PDFs can contain newlines */
       /* yy045 removes them so dont use yy045 */
   
    #define echo ECHO
    #define jmp BEGIN
    int fileno(FILE *);
   
   xa "%PDF-"
   xb "%%EOF" 
   
   %s xa 
   %option noyywrap nounput noinput
   %%
   
   {xa} echo;jmp xa;
   <xa>{xb} echo;jmp 0;
   <xa>.|\n|\r echo;
   .|\n
   
   %%
   int main(){ yylex();exit(0) ;}

   ^D

marttt 1471 days ago

Interesting post, many thanks. What's your view on w3m, as compared to the others you mention? (Side note: I'm a daily w3m user.)

1vuio0pswjnm7 1471 days ago

I have used it, although it was many years ago. I am not sure about the availability of prior versions of w3m but this is one thing I like about Links. I often compile-edit-recompile early versions and this helps me experiment and understand the development of the program over time. I like that w3m is also called a pager. Ideally I want an HTML reader/printer, a pager, with no networking code. Unless w3m has changed, Links does a better job with HTML tables.

marttt 1470 days ago

Thanks. What I like about w3m is 1) opening images via an external viewer if I want to, and 2) the UI, where everything is done on a command line at the bottom of the screen, vi-style. No input boxes like in Links. That aside, I remember thinking that Links did render some HTML elements better than w3m, though.

For contemporary w3m trickery, see http://w3m.rocks

EDIT: mynx looks interesting, I wasn't aware of it. Really close to my dream browsing experience: A browser that renders HTML as text, has only a few control keys (w3m has quite many, and it can cause confusion at times). Customizing would only be possible via config.h, including handles for viewing images, PDF files etc. I wonder why mynx lacks a "back" key, though.

1vuio0pswjnm7 1469 days ago

Opening files in an appropriate "external viewer" is how I remember browsers used to work. The assumption was that computer users had different dedicated programs to handle different MIME extensions. Links still purports to allow for using external viewers, though I do not use it that way. I do most ww retrieval _outside the browser_. Today so-called "modern" browsers are 150MB audio and video players, among a countless other things. The concept of the external viewer seems to have been lost.

There are things I dislike about Links. Certainly the NCurses menus and dialog boxes are less than ideal. But as an HTML renderer/printer it is the best program I have found. I recall that Elinks experimented with the vi-style command line. Elinks also created Lua bindings to allow for scripting. As an experiment, I started using Tmux to script Links. It surprised me how well this works. But overall, I have no need to script a browser because I prefer to work _outside the browser_.

marttt 1468 days ago

Yes, I also (vaguely) remember an ELinks branch with some kind of command line. I think I even tried to build it, but it felt too experimental for comfortable usage. Still a good effort, though.

I started to look into Links and ELinks again after reading (and upvoting :) many of your previous comments. I also got really curious about netcat. HTTPS won't work directly, but has anybody ever written a rudimentary, less/more-like front-end to actually browse the web while relying on netcat?

The way you separate browsing into different steps is really inspiring to me, thanks for sharing. Like, you're actually using the web in such a modular way. I'm afraid I won't be capable enough to replicate any of this for my needs (I'm a more of a hobbyist with a soft spot for lean, terminal- and text-based workflows, and abusing an old Dell Mini 9 in framebuffer mode as my main machine). But it does get me thinking, heavily, again. Watching a screencast of you "browsing" the web with your helper tools would be interesting.

I suppose with all these hand-tailored helpers, using the internet is a much more "focused" experience: looking for specific things vs the aimless browsing that contemporary tabbed browsers encourage. Easier to leave the internet alone when you rely on those narrowly focused tools, I guess.

As for lean browsers, Dillo with FLTK was also an extremely enjoyable experience under X. Really easy to switch off CSS, a nice config file for hand-tailoring search agents, etc. Using Dillo was when I first realized that I don't need to know how the website was intended to look like by the author. I'm fine with just rendering the body text with a tolerable, consistent font face.

It almost feels like that in 2022, the major thing why regular people need to update their systems is because the web browser "doesn't work". But, end of rant.