Hacker News new | ask | show | jobs
by nickpatch 4844 days ago
It corrupts Unicode data because it splits on code points instead of grapheme clusters. When performed on 'Spin̈al Tap', it splits the base character U+006E (LATIN SMALL LETTER N) from the combining character U+0308 (COMBINING DIAERESIS) and results in the string 'Spin<span class="s_char">̈</span>al Tap', which contains the valid Unicode grapheme cluster '>̈'! If you were to split on grapheme clusters instead, the result would be 'Spi<span class="s_char">n̈</span>al Tap'. However, I still wouldn't support that solution because it could negatively affect text segmentation used by search engine indexing and natural language processing tools.