Hacker News new | ask | show | jobs
by canimus 2269 days ago
Thank you @eesmith. Comments appreciated, and PRs to the repo as well. ;-) The multi-byte is great catch! I made the wrong assumption, on single byte separators. Perhaps a library limitation if the we want to keep the logic simple. Ideas on the fix?
1 comments

If it's a fixed-width encoding, nudge the read size to a multiple of that encoding size.

If it's utf-8, keep the block reads in byte space, search for the terminator as a byte sequence, and only decode after you find the terminator.

Otherwise, throw your hands up in the air and give up?

Catch the UnicodeDecodeError, use err.start, and see if it's close to the end of the block? If it is, then do another read?

BTW, you can mitigate some Python overhead by using a larger read size.