Hacker News new | ask | show | jobs
by tsimionescu 340 days ago
> Terminals that can't handle the full UTF-8 range are a problem with those terminals IMO. And terminals implemented in Rust probably don't have that problem :).

No, it isn't, and yes, they would. The problem is that the terminal accepts certain valid UTF-8 characters (typically from the ASCII subset) as output control characters. This is how you get things like programs that can output colored text.

This is a part of the fundamental design of how a terminal device is supposed to work: its input is defined to be a single stream of characters, and certain characters (or sequences of characters) represent control sequences that change the way other characters are output. The problem here is with the design of POSIX in general and Linux in particular - the fact that, despite knowing most interaction will be done through a terminal device with no separate control and data channels, they chose to allow control characters as part of file names.

As a result of this, it is, by design, impossible to write a program that can print out any legal file name to a terminal without risking to put the terminal in a display state that the user doesn't expect. Best you could do is recognize terminal control sequences in file names, recognize if your output device is a terminal, and in those cases print out escaped versions of those character sequences.

2 comments

> the terminal accepts certain valid UTF-8 characters (typically from the ASCII subset) as output control characters. This is how you get things like programs that can output colored text.

The terminal should not allow such a sequence to break it. Yes, being able to output colour is desirable, but it shouldn't come at the cost of breaking, and doesn't need to. (Whereas it is much less unreasonable for a terminal to break when it's sent invalid UTF-8).

> This is a part of the fundamental design of how a terminal device is supposed to work: its input is defined to be a single stream of characters, and certain characters (or sequences of characters) represent control sequences that change the way other characters are output.

"Design" and "supposed to" are overstating things. It's a behaviour that some terminals and some programs have accreted.

> it is, by design, impossible to write a program that can print out any legal file name to a terminal without risking to put the terminal in a display state that the user doesn't expect

I would not say by design, and I maintain that the terminal should handle it.

I believe you're misunderstanding the problem. The terminal doesn't "break" in the sense that it crashes or does something undefined for those cases. The terminal is doing something that is completely meaningful and well defined and probably has some realistic use cases, such as switching to a different character encoding.

The only problem is that it's not what the user wanted to happen. For a simple example, if a file name contains the control sequence for starting a block of red text, and you print that file name as is in a terminal, you'll, (1), see a truncated file name (that is, copying the text from terminal will not give you the actual file name, since the control characters will be entirely missing), and (2) all future text will be red.

The terminal has done nothing wrong in this case: it used its normal logic for turning text red. The file name is not in any way wrong - it's a completely valid Linux and ext4 file name. The program is not necessarily doing anything wrong - perhaps it was never designed to print to a terminal. But the overall interaction produces the wrong results.

> I believe you're misunderstanding the problem. The terminal doesn't "break" in the sense that it crashes or does something undefined for those cases. The terminal is doing something that is completely meaningful and well defined and probably has some realistic use cases, such as switching to a different character encoding.

I'm aware of the details, but I think sometimes that knowledge leads people to miss the forest for the trees. If the user perceives the terminal as having "broken", that's a case of poor UX design at a minimum. Given that users can readily distinguish between legitimate coloured output etc. and terminals getting into a poor state, it really shouldn't be too hard for the terminal itself to do so. (E.g. it's pretty normal for today's terminals to display some kind of visible warning (complete with resume button) when you press Ctrl-S, rather than simply silently stopping). And while this is a much fuzzier and more contentious claim, I think the Rust community's mentality (as seen in e.g. their approach to compiler errors) nudges people towards such approaches.

This is an entirely new claim - that the Terminal should try to understand what its input means. You can go ahead and try to implement this - I think you'll easily see that it's extremely hard to do so.

A terminal is basically a function foo(char_stream) = formatted_char_stream. It has no idea whatsoever what the input means, or what the output is supposed to mean. Your Ctrl-S example is completely in line with this: it's one control code that the Terminal chooses to display/interpet in a certain way (older terminals would just stop, newer terminals display some warning text and wait for user input). Recognizing that the start-red-output control sequence should not be interpreted as a control sequence if it's coming from the output of `ls` is a fundamentally different type of change.

Would it be nice to have a different concept, a CmdDisplayer that takes as input (commands, command text, control text) and outputs formatted text while understanding its input? Maybe. But it wouldn't be a terminal, and it would require a fundamental redesign of every single program that wants to meaningfully interact with it - especially all shells and any TUI program that might make the most use of such a tool.

> This is an entirely new claim - that the Terminal should try to understand what its input means.

I don't see it as a different claim. My position is that the terminal should not move into a bad state for any valid input, and that a state that the user understands as being "broken" is probably a bad state. To my mind this should not require detailed understanding of what each program is doing, which I agree is difficult, but just some more basic things like making it clearer to the user why their terminal is in a funny state (and how to undo it), or perhaps ensuring that the terminal reverts to a good state whenever a program that had changed its state exits.

Having a way to return to the default would be nice, but it wouldn't fix the base problem that a program can't simply print a file name to stdout if it thinks stdout might be connected to a terminal. Even if the terminal didn't "break", it would still not display the file name in a useful way (e.g. a user couldn't copy paste it from there).
I don't think it's Linux so much as it is any given filesystem implementation. As I understand it validation is entirely up to the filesystem itself. I could be mistaken but I don't believe there's anything stopping you from implementing a filesystem that uses raw binary data for filenames.

There's also the question of what happens if the data structures on disk become corrupted. The filesystem driver might or might not validate the "string" it reads back before returning it to you.

Linux itself exposes various syscalls that operate with filenames, userland programs can't interact directly with the FS driver. But Linux chose to implement only 2 restrictions at the syscall level (slash used to separate elements of the path and NULL used to mark the end of the input). The kernel will resolve the path to a particular file system and send a subset of the path to the corresponding FS driver exactly as it received them, and the FS can choose whether to accept or reject them. Most Linux FSs don't apply any extra restrictions either. The main exceptions are FSs written to interface with other systems, such as CIFS or SMB, which additionally apply DOS/Windows filename restrictions by necessity.

If Linux had chosen to standardize file names at the syscall level to a safe subset of UF-8 (or even ASCII), FS writers would never have even seen file names that contain invalid sequences, and we would have been spared from a whole set terminal issues. Of course, UTF-8 was nowhere close to as universally adopted as it is today at the time Linux was developed, so it's just as likely they might have standardized to some subset of UTF-16 or US-ASCII and we might have had other kinds of problems, so it's arguable they took the right decision for that time.