| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andolanra 4995 days ago
	If you implement naïve Aho-Corasick text search over one-byte characters, it works without modification on UTF-8 text. It does not ignore combining characters, but UCS-2 also features combining characters (c.f. other comments in this same thread), so no matter what encoding you use, you must first normalize the Unicode text and the search string before you compare for equivalence (or compatibility, which is a looser notion than equality for Unicode code point sequences.)