| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eric4smith 1407 days ago

PostgreSQL text search is awesome - for English and Roman type languages.

But Asian languages such as Thai, Japanese, Korean, etc are not going to work at all.

PostgreSQL is weird about joins. Joining on certain columns could be super fast but others dog slow. And this can flip depending on size of table and this index Cardinality.

That’s why it’s important on databases that grow quickly to check the performance of even simple queries as those can balloon in execution time as the profile of the data changes.

3 comments

asah 1407 days ago

Support for i18n: https://www.google.com/search?q=pgroonga

link

eric4smith 1406 days ago

Yeah we use it.

It’s however a database within a database. PostgreSQL does not natively text search on these kinds of languages.

link

jxf 1407 days ago

> But Asian languages such as Thai, Japanese, Korean, etc are not going to work at all.

I understand that there are likely to be many significant differences in graphemes in these languages from Roman text but I'm not familiar enough with any Asian language to construct an example. Can you give an illustrative example that explains why the search doesn't work as well (or at all) in those cases?

link

devonkim 1406 days ago

Not intimately familiar with how text search works in indexing but in most Romanized / Latin script text determiners, articles, etc. are space separated from the nouns which can be confusing and introduce state into queries due to the need to perform some splitting within character sequences. This isn’t the same thing as finding the roots of words / stemming for fuzzy search purposes either. “짬뽕이 맛있습니다” has a plain noun 짬뽕 with case marking via -이 and the ending copula is parseable as a run on phrase but Finnish has case marking without space separation too and doesn’t seem to be cited as a parse / representation problem last I saw. In English it’s “the 짬뽕 is delicious” where noun is obvious and if you split by spaces you can quickly throw away “the” and “is” while it’s not clear in the Korean until you check for the case marker and prior glyphs for a parse. Now, where I think there can be issues is in Unicode glyph representations where multiple codes can wind up to the same symbol.

link

wahnfrieden 1407 days ago

no spaces in japanese, it's tricky to determine word boundaries and conjugations and multiple "spellings" of the same word. the libs that tokenize and de-inflect languages are usually highly specialized technologies for east asian languages (maybe others but i'm particularly familiar with japanese, korean, chinese)

link

eric4smith 1406 days ago

No spaces in Thai also. Difficult to discern word boundaries.

link

SnowHill9902 1407 days ago

Can you give more information about your JOINs performance? It’s mostly dependent on the presence and optimization of INDEXes and cardinality as you mention. JOIN ON indexed integers is usually fastest.

link

eric4smith 1406 days ago

It’s not such an easy thing to describe.

Yes we join on indexed integers too.

The thing is, we believe as the amount of data changes the planner will give different instructions. Even after lots of analyzing and vacuuming.

It’s happened so many times and so randomly and because of growth we really have to re-examine joins that were very fast 6 months ago that are now really slow.

link