Hacker News new | ask | show | jobs
by nwellnhof 1706 days ago
Working directly on encoded UTF-8 sequences is a nice trick that allows to lookup Unicode properties without even decoding a character. I did something similar for Apache Lucy [1]. Note that you can store the data for each "level" in a single table and compute the index with bit operations as explained in the article.

[1] https://gitbox.apache.org/repos/asf?p=lucy.git;a=blob;f=core...