|
|
|
|
|
by louai
1901 days ago
|
|
Note that it says less than one character. A character in UTF-8 can be composed of multiple bytes. The encoding scheme is laid out in the linked email. Based on the high bits it's possible to detect when a new character starts. Relevant portion: We define 7 byte types:
T0 0xxxxxxx 7 free bits
Tx 10xxxxxx 6 free bits
T1 110xxxxx 5 free bits
T2 1110xxxx 4 free bits
T3 11110xxx 3 free bits
T4 111110xx 2 free bits
T5 111111xx 2 free bits
Encoding is as follows.
>From hex Thru hex Sequence Bits
00000000 0000007f T0 7
00000080 000007FF T1 Tx 11
00000800 0000FFFF T2 Tx Tx 16
00010000 001FFFFF T3 Tx Tx Tx 21
00200000 03FFFFFF T4 Tx Tx Tx Tx 26
04000000 FFFFFFFF T5 Tx Tx Tx Tx Tx 32
[...] 4. All of the sequences synchronize on any byte that is not a Tx byte.
If you are starting mid-run, skip initial Tx bytes. That will always be less than one character. |
|