| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 1letterunixname 814 days ago

There most certainly is getwchar() and fgetwc()/getwc() on anything that's POSIX C95, so that's more or less everything that's not a vintage antique.

Reading individual UTF-8 codepoints is a trivial exercise if byte width getchar() were available, and portable C code to do so would be able to run on anything made after 1982. IIRC, they don't teach how to write portable C code in Comp Sci programs anymore and it's a shame.

Never read a file completely into memory at once unless there is zero chance of it being a huge file because this is an obvious DoS vector and waste of resources.

1 comments

scottlamb 813 days ago

> There most certainly is getwchar() and fgetwc()/getwc() on anything that's POSIX C95, so that's more or less everything that's not a vintage antique.

Apologies for the imprecision: by OS API, I meant syscall, at least on POSIX systems. The functions you refer to are C stdio things. Note also they implement on top of read(2) one of the two options I mentioned: "loop over getting the next N bytes and getting all complete characters so far (with some extra complexity around characters that cross chunk boundaries)".

btw, if we're being precise, getwchar gets a code point, and character might mean grapheme instead. Same is true for the `str::chars` call in the LLM's Rust snippet. The docstring for that method mentions this [1] because it was written in this century after people thought about this stuff a bit.

> portable C code to do so would be able to run on anything made after 1982.

Our comments are part of a thread discussing this prompt [2] that specifically requests Rust and this snippet in response [3]. Not portable C code. You can use those C stdio functions from Rust, but you really shouldn't without a very good reason. Rust has its own IO library that is safe and well integrated with other Rust things like `#![derive(Debug)]`.

[1] https://doc.rust-lang.org/std/primitive.str.html#method.char...

[2] https://news.ycombinator.com/item?id=39910542

[3] https://news.ycombinator.com/item?id=39910542

link

1letterunixname 813 days ago

Yes, the userland side presented such as with POSIX like ssize_t read(int fd, void* buf, size_t count). Calling that with count = 1 each time would be wasteful, but certainly libc's have been buffering this since at least the 1980's. I remember this was the case with Borland C/C++.

> Our comments are part of a thread discussing this prompt [2] that specifically requests Rust and this snippet in response [3]. Not portable C code. You can use those C stdio functions from Rust, but you really shouldn't without a very good reason. Rust has its own IO library that is safe and well integrated with other Rust things like `#![derive(Debug)]`.

Duh. It doesn't really matter what Rust has have went it comes to enabling the use of specific edge-case performance improvements for specific purposes. Inefficient AI-generated code without a clue of other approaches doesn't move the needle. Religious purity doesn't matter, only results matter.

link

scottlamb 813 days ago

> Duh. It doesn't really matter what Rust has have went it comes to enabling the use of specific edge-case performance improvements for specific purposes. Inefficient AI-generated code without a clue of other approaches doesn't move the needle. Religious purity doesn't matter, only results matter.

No idea what this incoherent, ungrammatical paragraph is supposed to be saying. But if you're under the impression Rust doesn't have its own buffered IO facilities or that using Rust-native libraries offers only "religious purity" benefits over extern "C" stuff, you're mistaken.

This has diverged from what I'm interested in discussing anyway; see my question upthread about if there are any LLM tools that gather requirements from incomplete specs in the way I expect human engineers to. In this case, I'd expect it to ask questions such as "how large are input files expected to be?" Better, ask what the greater purpose is, as "character by character" is rarely useful.

link