|
|
|
|
|
by TallGuyShort
6169 days ago
|
|
It depends very heavily on the origin of the string, as that would determine the special cases that needed to be dealt with. Can you provide more details? edit: Based on what you said in your original post, I would say to have a list of possible delimiters (which would probably need to be added to for some time), and tokenize the string according to that, and discard any token that appears in a second list of words that don't matter (conjunctions, articles, prepositions, etc...). Before discarding said strings, you'd also want to check if they're operators used in your app, or anything like that. |
|
Basically, I think of a string like "ham, egg." which should result in "ham" and "egg", and "Ветчина, яйцо." should also result in "Ветчина" and "яйцо".
The challenge is that you cannot whitelist all possible characters as there are (imho) too many charsets.