Hacker News new | ask | show | jobs
by jhallenworld 1707 days ago
I did a four-level radix tree for the Unicode character class and case conversion support in JOE (Joe's Own Editor, so that you can edit Unicode on non-Unicode systems): index sizes of 7 bits, 5, 5 and 4 plus ASCII has a fast-path table.

The main difference is that it computes the tables at startup from the sorted Unicode intervals. So the construction code has to be fast. The same code is also used for user character classes in regular expressions.

Anyway, it builds them in two passes. First pass it de-duplicates nodes, but only the previously constructed node is a candidate for de-duplication. This keeps the memory usage low during construction. De-duplicated nodes can still be modified during this construction, so they may be re-duplicated (there is a reference counter to determine when this happens).

Second pass (after all data is loaded, no more changes allowed), it globally de-duplicates the leaf nodes using a hash table. Many of the leaf nodes are duplicates (and not just the all zero ones).

1 comments

Oh wow, JOE was the first editor I've used on Linux -- ages ago -- and it's still so much better than other very popular choices today. It's friendly, fast, feature-packed and the keyboard shortcuts (of WordStar heritage) are lovely.

JOE is one of my favourite software projects, one I can only look up to. I don't know anything about you, but thank you!