| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zbentley 2810 days ago

I don't think so. If you want to detect and operate on only the data that could represent ASCII characters, you could, certainly process it as a byte string if you wanted, but you'd have to track the presence of non-ASCII-range character codes yourself, and keep state around to represent whether you were in the middle of a multibyte character as you read through the bytes.

If done right, it would be a (probably much slower) re-implementation of what happens when you use the latin1 trick mentioned. You have to get it right, though (sneaky edge cases abound--what if the file starts in the middle of an incomplete multibyte character?).

TL;DR this could technically work but is a poor idea.

1 comments

buckminster 2810 days ago

This is talking about the case where you don't know the encoding. So you don't know which byte sequences are multibyte characters. Whether you use latin1 or bytes the edge cases are exactly the same, and they don't get handled.

link