| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by devonkim 1410 days ago
	Not intimately familiar with how text search works in indexing but in most Romanized / Latin script text determiners, articles, etc. are space separated from the nouns which can be confusing and introduce state into queries due to the need to perform some splitting within character sequences. This isn’t the same thing as finding the roots of words / stemming for fuzzy search purposes either. “짬뽕이 맛있습니다” has a plain noun 짬뽕 with case marking via -이 and the ending copula is parseable as a run on phrase but Finnish has case marking without space separation too and doesn’t seem to be cited as a parse / representation problem last I saw. In English it’s “the 짬뽕 is delicious” where noun is obvious and if you split by spaces you can quickly throw away “the” and “is” while it’s not clear in the Korean until you check for the case marker and prior glyphs for a parse. Now, where I think there can be issues is in Unicode glyph representations where multiple codes can wind up to the same symbol.