|
|
|
|
|
by nwellnhof
1706 days ago
|
|
Working directly on encoded UTF-8 sequences is a nice trick that allows to lookup Unicode properties without even decoding a character. I did something similar for Apache Lucy [1]. Note that you can store the data for each "level" in a single table and compute the index with bit operations as explained in the article. [1] https://gitbox.apache.org/repos/asf?p=lucy.git;a=blob;f=core... |
|