|
|
|
Ask HN: How do you split strings (to get keywords)?
|
|
2 points
by gopher
6169 days ago
|
|
First trial, one splits on whitespace, but this sucks on interpunction and special characters. Second trial, you use a alpha-numeric whitelist and split on anything else, but what about umlauts? What about hebrew or cyrillic? Third trial: split on characters < 32, whitespace and interpunction characters; this works somehow but is ugly. What would you do to get keywords from a string? |
|
edit: Based on what you said in your original post, I would say to have a list of possible delimiters (which would probably need to be added to for some time), and tokenize the string according to that, and discard any token that appears in a second list of words that don't matter (conjunctions, articles, prepositions, etc...). Before discarding said strings, you'd also want to check if they're operators used in your app, or anything like that.