|
|
|
|
|
by masklinn
3581 days ago
|
|
ignore this, I somehow missed that the poster specifically mentioned Python 3, which does have an encoding-aware CSV module. ~~It's also unclear which version of Python is being used, the Python 2 csv module is byte-based and encoding-unaware which can lead to unexpected behaviours.~~ Go's CSV package apparently only does UTF-8, and suggestions for speeding it up in the tracker is to just remove that and work on raw bytes (FFS) |
|
This is valid, because UTF-8 was designed to make this valid. The UTF-8 encoding of a comma, 0x2C (also the ASCII encoding of a comma), does not appear as a part of any other UTF-8 encodings. Same with the UTF-8 encoding of the double quote, 0x22. So scanning for 0x22 and 0x2C bytes, without stopping to decode other UTF-8 sequences along the way, will produce the correct result for a valid UTF-8 input string. Then you fully decode UTF-8 for the individual fields when needed (and if you're doing a string-compare for some target value that's already UTF-8, you never need to decode UTF-8 for that field at all).