Hacker News new | ask | show | jobs
by chrismorgan 1883 days ago
One major problem with this: you can’t do it in place. Invalid byte sequences could be 1–4 bytes long, but U+FFFD is exactly three bytes long.
2 comments

There're still faster approaches than naïve and probably common "copy valid byte sequences one by one into a resizable result buffer". For instance, scan through the input bytes all at once, keeping track of position and length of valid sequences, then memcopy each valid sequence into a preallocated buffer.

Edit: Although, it looks like Rust's std already does this, except for preallocating an exactly correct size result buffer: https://doc.rust-lang.org/src/alloc/string.rs.html#538

> preallocating an exactly correct size result buffer

Looks like it just uses the size of the original slice. If the average broken chunk is less than three bytes (maybe quite common?) then it'll have to grow the buffer, at least doubling it.

  >> let bytestring = b"foobar\xcc";
  >> bytestring.len()
  7
  >> let cleaned = String::from_utf8_lossy(bytestring).into_owned();
  >> cleaned.len()
  9
  >> cleaned.capacity()
  14
Use an ASCII character like `?` or, even better, ASCII SUB (0x1A, substitute), which is specifically intended for this sort of thing. Failing that, there are a number of other unused ASCII control characters like VT (vertical tab), FS (file separator), GS (group separator), US (unit separator), CAN (cancel), etc.
No. Don't do any of these things. The reason U+FFFD exists, even though ASCII has any number of fun things you can scribble in one byte is that U+FFFD specifically isn't any of the things your program probably didn't expect to appear unexpectedly after unrelated processing.

It isn't a letter, or a digit, or whitespace, or punctuation, or a word separator, or a control character, it is neither uppercase nor lowercase, it doesn't have any canonical equivalents - it's just a codepoint that exists specifically for this purpose.

As a result it's much less likely that if gibberish sneaks into your system somehow and gets turned into U+FFFD this causes something important to break elsewhere.

And when sooner or later a human is shown this text, it's very obvious that U+FFFD isn't what they expected, whether that was E-acute, a Euro currency symbol, a cat emoji or whatever else, and the human will know something went wrong and can decide if they care about that.

This is about what can be done in SIMD. See GP and GGP. I think using ASCII SUB is a perfectly reasonable thing to do in this context. You can always further post-process the result to turn ASCII SUB into U+FFFD.