|
|
|
|
|
by Avernar
3209 days ago
|
|
What do you mean by "expose UTF-8"? Because nothing about UTF-8 requires that you give byte access to the string. As for indexing, strings shouldn't require indexing period. That's the ASCII way of thinking, especially fixed width columns and such. You should be thinking relatively. For example, find me the first space then using that point in the string the next character needs to be letter. When you build you're code that way you don't fall for the trap of byte indexing or the performance hit of codepoint indexing (UTF-8) or grapheme indexing (all encodings). |
|
For example, I work for a company that does business in the (US) Medicare space. Every Medicare beneficiary has a HICN -- Health Insurance Claim Number -- and HICNs come in different types which need to be identified. Want to know how to identify them? By looking at prefix and suffix characters in specific positions, and the length of what comes between them. For example, the prefix 'A' followed by six digits means the person identified is the primary beneficiary and was first covered under the Railroad Retirement Board benefit program. Doing this without indexing and length operations is madness.
These data types can and should be subjected first to some basic checks to ensure they're not nonsense (i.e., something expected to be a numeric value probably should not contain Linear B code points, and it's probably a good idea to at least throw a regex at it first, but then applying regex to Unicode also has quirks people don't often expect at first...).