Hacker News new | ask | show | jobs
by lifthrasiir 4276 days ago
The 60% threshold for the single-language scripts seems to be way low for CJK languages. And your method to calculate the occurrence ratio is flawed.

CJK scripts and languages tend to be relatively more concise (in terms of # of Unicode codepoints) than many other languages, so it is possible that the ratio of CJK scripts over non-CJK scripts can be lower than the average. And the occurrence ratio is currently calculated over the number of characters including non-letters, making the ratio much lower. Maybe the custom threshold per script based on the actual corpus (90th percentile, maybe?) and better occurrence calculation would improve the detection on those languages.

1 comments

I’m not sure. I don’t know any CJK languages myself. I’d like some test-cases where the current methods do not work, as the example in the Readme seems to work pretty well: `এটি একটি ভাষা একক IBM স্ক্রিপ্ট` is classified as Bengali?
Some examples follow. I've really tested with arbitrary text on the Web, and I agree that they are somewhat marginal examples. (But I do think that Franc's margin for CJK languages is way wide.)

한국어 문서가 전 세계 웹에서 차지하는 비중은 2004년에 4.1%로, 이는 영어(35.8%), 중국어(14.1%), 일본어(9.6%), 스페인어(9%), 독일어(7%)에 이어 전 세계 6위이다. 한글 문서와 한국어 문서를 같은 것으로 볼 때, 웹상에서의 한국어 사용 인구는 전 세계 69억여 명의 인구 중 약 1%에 해당한다.

This text from Korean Wikipedia is about the ratio of Korean documents over all documents in the Internet. Digits distort the overall ratio and Franc doesn't give any candidates (even no "und").

現行の学校文法では、英語にあるような「目的語」「補語」などの成分はないとする。英語文法では "I read a book." の "a book" はSVO文型の一部をなす目的語であり、また、"I go to the library." の "the library" は前置詞とともに付け加えられた修飾語と考えられる。

This text from Japanese Wikipedia concerns about the distinction of objectives and complements in the English syntax. In this bilingual text it looks like that Japanese has reached the 60% threshold but the codepoint count doesn't.

I pushed a fix, incorporating your suggestions, and your examples in the specs.

Thanks a lot!

Oh you’re right. I think I have a fix in mind, will work on it. Thanks so much!