|
|
|
|
|
by def-lkb
1206 days ago
|
|
I don't think you should be worried about Unicode in particular. Although the derivation formula on paper is parameterized by a character, you don't have to compute the derivative for every character separately. It's actually easy to compute classes of characters that have the same derivative (it's done in the linked "Regular-expression derivative re-examined" paper, although their particular implementation favors simplicity over efficiency), and it's not even necessary when using Antimirov's partial derivatives. Actually, the complexity of the derivation is independent of the size of the alphabet. You could even define derivation on an arbitrary semi-lattice, not necessarily a set of characters. (Or a boolean algebra if you care about negation/complementation). The difficulty in handling unicode has more to do with the efficiency of the automaton representation and manipulation rather than in turning the RE in an NFA or DFA. |
|
It has been a long time since I read the "Regular-expression derivative re-examined" derivative paper. Mostly the only thing I remember at this point is that I came away thinking that it would be difficult to adapt in practice for large Unicode classes. But I don't remember the details.
It is honestly very difficult for me to translate your comment here into an actionable implementation strategy. But that's probably just my inexperience with derivatives talking.