Hacker News new | ask | show | jobs
by recoil 2959 days ago
> you shouldn't really need a dedicate script to do that

It isn't because the terminal is mis-configured and/or mis-interpreting those characters, it's because these characters have historically been used in a special way by troff's ascii output driver, as used by the man command.

Bold characters are "emulated" by outputting the character itself, followed by ^H, then by repeating the character, and underline is emulated by outputting _ (the underscore character) followed by ^H, followed by the character to be underlined. This is the same way that bold/underline was achieved on a manual typewriter or by old teletype terminals with paper output.

Pagers like "more" & "less" have in-built behaviour that knows how to interpret these sequences and render bold or underline appropriately, and if you "cat" the file your terminal would probably ignore them, but if you open a file with those sequences in a text editor, you're going to see a bunch of unnecessary ^H characters. The OP's script uses the "col" command to remove the unnecessary ^Bs (and the preceding character) that troff has output.

By default, GNU groff doesn't actually output those sequences anymore and uses ANSI escape codes instead. AFAIK many (most?) distributions actually compile out the ANSI behaviour in favour of the old way though (because "more" and "less" don't actually behave correctly with the ANSI characters by default), but some don't, which breaks the script (it's broken in Cygwin, for instance).

FWIW, if you need a fool-proof way to convert a man page to plain old ASCII with no escapes at all, it's easiest just to redirect the output of "man" to a file:

    man ls > ls.txt
The long-winded way (with groff) is something like:

    zcat /usr/share/man/man1/ls.1.gz | groff -man -Tascii -P -cbdu > ls.txt
3 comments

>Pagers like "more" & "less" have in-built behaviour that knows how to interpret these sequences and render bold or underline appropriately,

Right

>and if you "cat" the file your terminal would probably ignore them

I think the terminal does not ignore them. My guess is that it interprets the characters, but it has no noticeable visual effect on the screen. E.g. if the letter "c" in "cat" is output as "c^hc" (to make in bold in print), the terminal will just print "c", a backspace, then "c" again, which to the user will look the same as a single "c".

You described the issue and cause better than I did :)

I didn't know about the ANSI behavior, interesting.

>FWIW, if you need a fool-proof way to convert a man page to plain old ASCII with no escapes at all, it's easiest just to redirect the output of "man" to a file:

> man ls > ls.txt

IIRC, even with that way, the control chars still appeared in the file (at least on some Unix version), which is why I wrote the script in the first place.

> the control chars still appeared in the file (at least on some Unix version)

That's a good point. My whole reply is a bit GNU-centric: the "col -bx" solution should work with everything except groff in ANSI mode (looks like the flag to go back to the standard troff behaviour is GROFF_NO_SGR [1][2], in case anybody is interested).

[1] See: https://linux.die.net/man/1/grotty [2] SGR is "Select Graphics Rendition", apparently: https://en.wikipedia.org/wiki/ANSI_escape_code#SGR_(Select_G...

Ah fair enough. The few times I have seen this issue was when jumping on old Solaris SPARC boxes which also mangled the the backspace key in the interactive terminal. So it sounds like I've put 2 and 2 together and gotten 5.

Though coincidentally I do the redirect trick a lot as was as querying the gzipped raw files too (one of the projects I'm working on requires building a cutdown man page parser).