Hacker News new | ask | show | jobs
by serbuvlad 340 days ago
What's a memory issue?

If I access beyond the end of an array in Rust, the panic handler runs and starts unwinding my stack. If I access beyond the end of an array in C++ with .at() the excwption handler runs and starts unwining my stack. If I access beyond the end of an array in C the SIGSEGV handler may (*) run and I could, if I wanted to, start unwinding my stack.

Ah, but in C, sometimes if I access the wrong memory, I get garbadge instead of a panic.

Sure, and if I store my data in a Rust array and store indexes into that array around the place as sort of weak references (something I've seen Rust programmers use and talk about all the time), I can easily fetch the wrong data too.

Rust provides a robust type system and a borrow checker which avoids a lot of common problems at the expence of adhering to a particular programming style. That's fine. That's worth advocating for.

But it's no pannacea. Not even close.

My favorite memory about this is a programmer lambasting Go's strings (which are basically immutable byte vectors) for not enforcing UTF-8, like Rust strings.

He then said that this means that in Go you can print filenames to the screen that can break your terminal session because of this if they contain invalid UTF-8, which Rust forces you to escape explicitly. The irony, of couse, is that the characters that can break your terminal session are perfectly valid UTF-8.

Rust's type safety convinced this guy that his Rust program was immune to a problem that it was simply not immune to.

2 comments

> Sure, and if I store my data in a Rust array and store indexes into that array around the place as sort of weak references (something I've seen Rust programmers use and talk about all the time), I can easily fetch the wrong data too.

Maybe, but you'll do so in an understandable way that you can catch in testing. You won't suddenly start fetching different wrong data when someone builds your program with a newer version of the compiler, which is a very real risk with C. To say nothing of the risk of arbitrary code execution if your program is operating on attacker-supplied data.

> The irony, of couse, is that the characters that can break your terminal session are perfectly valid UTF-8.

Terminals that can't handle the full UTF-8 range are a problem with those terminals IMO. And terminals implemented in Rust probably don't have that problem :).

> Terminals that can't handle the full UTF-8 range are a problem with those terminals IMO. And terminals implemented in Rust probably don't have that problem :).

No, it isn't, and yes, they would. The problem is that the terminal accepts certain valid UTF-8 characters (typically from the ASCII subset) as output control characters. This is how you get things like programs that can output colored text.

This is a part of the fundamental design of how a terminal device is supposed to work: its input is defined to be a single stream of characters, and certain characters (or sequences of characters) represent control sequences that change the way other characters are output. The problem here is with the design of POSIX in general and Linux in particular - the fact that, despite knowing most interaction will be done through a terminal device with no separate control and data channels, they chose to allow control characters as part of file names.

As a result of this, it is, by design, impossible to write a program that can print out any legal file name to a terminal without risking to put the terminal in a display state that the user doesn't expect. Best you could do is recognize terminal control sequences in file names, recognize if your output device is a terminal, and in those cases print out escaped versions of those character sequences.

> the terminal accepts certain valid UTF-8 characters (typically from the ASCII subset) as output control characters. This is how you get things like programs that can output colored text.

The terminal should not allow such a sequence to break it. Yes, being able to output colour is desirable, but it shouldn't come at the cost of breaking, and doesn't need to. (Whereas it is much less unreasonable for a terminal to break when it's sent invalid UTF-8).

> This is a part of the fundamental design of how a terminal device is supposed to work: its input is defined to be a single stream of characters, and certain characters (or sequences of characters) represent control sequences that change the way other characters are output.

"Design" and "supposed to" are overstating things. It's a behaviour that some terminals and some programs have accreted.

> it is, by design, impossible to write a program that can print out any legal file name to a terminal without risking to put the terminal in a display state that the user doesn't expect

I would not say by design, and I maintain that the terminal should handle it.

I believe you're misunderstanding the problem. The terminal doesn't "break" in the sense that it crashes or does something undefined for those cases. The terminal is doing something that is completely meaningful and well defined and probably has some realistic use cases, such as switching to a different character encoding.

The only problem is that it's not what the user wanted to happen. For a simple example, if a file name contains the control sequence for starting a block of red text, and you print that file name as is in a terminal, you'll, (1), see a truncated file name (that is, copying the text from terminal will not give you the actual file name, since the control characters will be entirely missing), and (2) all future text will be red.

The terminal has done nothing wrong in this case: it used its normal logic for turning text red. The file name is not in any way wrong - it's a completely valid Linux and ext4 file name. The program is not necessarily doing anything wrong - perhaps it was never designed to print to a terminal. But the overall interaction produces the wrong results.

> I believe you're misunderstanding the problem. The terminal doesn't "break" in the sense that it crashes or does something undefined for those cases. The terminal is doing something that is completely meaningful and well defined and probably has some realistic use cases, such as switching to a different character encoding.

I'm aware of the details, but I think sometimes that knowledge leads people to miss the forest for the trees. If the user perceives the terminal as having "broken", that's a case of poor UX design at a minimum. Given that users can readily distinguish between legitimate coloured output etc. and terminals getting into a poor state, it really shouldn't be too hard for the terminal itself to do so. (E.g. it's pretty normal for today's terminals to display some kind of visible warning (complete with resume button) when you press Ctrl-S, rather than simply silently stopping). And while this is a much fuzzier and more contentious claim, I think the Rust community's mentality (as seen in e.g. their approach to compiler errors) nudges people towards such approaches.

This is an entirely new claim - that the Terminal should try to understand what its input means. You can go ahead and try to implement this - I think you'll easily see that it's extremely hard to do so.

A terminal is basically a function foo(char_stream) = formatted_char_stream. It has no idea whatsoever what the input means, or what the output is supposed to mean. Your Ctrl-S example is completely in line with this: it's one control code that the Terminal chooses to display/interpet in a certain way (older terminals would just stop, newer terminals display some warning text and wait for user input). Recognizing that the start-red-output control sequence should not be interpreted as a control sequence if it's coming from the output of `ls` is a fundamentally different type of change.

Would it be nice to have a different concept, a CmdDisplayer that takes as input (commands, command text, control text) and outputs formatted text while understanding its input? Maybe. But it wouldn't be a terminal, and it would require a fundamental redesign of every single program that wants to meaningfully interact with it - especially all shells and any TUI program that might make the most use of such a tool.

I don't think it's Linux so much as it is any given filesystem implementation. As I understand it validation is entirely up to the filesystem itself. I could be mistaken but I don't believe there's anything stopping you from implementing a filesystem that uses raw binary data for filenames.

There's also the question of what happens if the data structures on disk become corrupted. The filesystem driver might or might not validate the "string" it reads back before returning it to you.

Linux itself exposes various syscalls that operate with filenames, userland programs can't interact directly with the FS driver. But Linux chose to implement only 2 restrictions at the syscall level (slash used to separate elements of the path and NULL used to mark the end of the input). The kernel will resolve the path to a particular file system and send a subset of the path to the corresponding FS driver exactly as it received them, and the FS can choose whether to accept or reject them. Most Linux FSs don't apply any extra restrictions either. The main exceptions are FSs written to interface with other systems, such as CIFS or SMB, which additionally apply DOS/Windows filename restrictions by necessity.

If Linux had chosen to standardize file names at the syscall level to a safe subset of UF-8 (or even ASCII), FS writers would never have even seen file names that contain invalid sequences, and we would have been spared from a whole set terminal issues. Of course, UTF-8 was nowhere close to as universally adopted as it is today at the time Linux was developed, so it's just as likely they might have standardized to some subset of UTF-16 or US-ASCII and we might have had other kinds of problems, so it's arguable they took the right decision for that time.

It's even worse than that. If a newer version of the compiler is able to leverage knowledge of the array bounds it could "optimize" away an entire chunk of your program. It probably won't do that because compiler authors supposedly aren't openly hostile towards compiler users but it isn't so easy to write an algorithm that will flag such "obviously" wrong things.

The control characters are themselves valid (but unprintable) UTF-8. They are also, against all common sense and reason, permitted within filenames by many filesystems. Rust won't save you here. https://www.compart.com/en/unicode/category/Cc

> Maybe, but you'll do so in an understandable way that you can catch in testing. You won't suddenly start fetching different wrong data when someone builds your program with a newer version of the compiler, which is a very real risk with C. To say nothing of the risk of arbitrary code execution if your program is operating on attacker-supplied data.

What guarantees it? Literally nothing. You can catch errors in testing in C as well. Yeah, in C you get "different" data on a different version of the compiler, but you get garbage data in all versions of the compiler and Valgrind flags that in testing.

And, of course, you can get arbitrary code execution if your program is operating on data coming from multiple users of different privilege levels in Rust if you use vectors like that.

Sure, Rust fixes a lot of easy to commit bugs in C and C++, absolutely. But there are no absolutes.

> What guarantees it? Literally nothing.

The behaviour of looking up index x in array y in rust is well-defined, and consistent between compiler versions. Maybe you use the wrong index and get the wrong customer's data or something, but you'll still get the data at that index (or a panic if the index is invalid).

> You can catch errors in testing in C as well. Yeah, in C you get "different" data on a different version of the compiler, but you get garbage data in all versions of the compiler and Valgrind flags that in testing.

Not always. It's very common for code to be broken according to the standard but do the right thing in some compilers, and then in a different compiler it does something completely bizarre. The code might not do the same thing in testing as it does in release. And while Valgrind does a lot of good for the minority of C programmers who use it, it's far from 100% reliable.

> you can get arbitrary code execution if your program is operating on data coming from multiple users of different privilege levels in Rust if you use vectors like that.

You might in some cases, but it's a lot tricker. You can get the program to operate on different parts of its data than it was intended to, but to go from there to running arbitrary code will still be a significant leap and require an exploitation technique specific to that particular program. Whereas the techniques for going from common classes of C vulnerabilities (e.g. buffer overflow or use after free) to arbitrary code execution are practically textbook at this point.

Funny thing is that you can get undefined behavior and segfaults using only "safe rust", and the rust compiler has subtle bugs that allow you to disable important checks (like type checking), which can leave your code completely broken.

But for some crazy propaganda, rust devs believes that any rust code is safe and sound no matter what.

https://github.com/Speykious/cve-rs/issues/49