Hacker News new | ask | show | jobs
by mjevans 1901 days ago
The upper bits of the FIRST octet are used to determine the run length of the sequence. All of the other bytes in the sequence use the upper two bits (0xC prefix len 2 OR b10xxxxxx) to indicate that it's another 6 bits of data for the current character.

If synchronization is lost mid-character, by definition that interrupted character is lost. However the very next complete character will be clearly indicated by a byte beginning with either no sign (a 7 bit character) OR a number of 1s indicating the octet count followed by a zero.

This is covered in the section titled:

    Proposed FSS-UTF
    ----------------
    ...
       Bits  Hex Min  Hex Max  Byte Sequence in Binary
    1    7  00000000 0000007f 0vvvvvvv
    2   11  00000080 000007FF 110vvvvv 10vvvvvv
    3   16  00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv
    ... Examples trimmed for mobile.