|
|
|
|
|
by lifthrasiir
4276 days ago
|
|
The 60% threshold for the single-language scripts seems to be way low for CJK languages. And your method to calculate the occurrence ratio is flawed. CJK scripts and languages tend to be relatively more concise (in terms of # of Unicode codepoints) than many other languages, so it is possible that the ratio of CJK scripts over non-CJK scripts can be lower than the average. And the occurrence ratio is currently calculated over the number of characters including non-letters, making the ratio much lower. Maybe the custom threshold per script based on the actual corpus (90th percentile, maybe?) and better occurrence calculation would improve the detection on those languages. |
|