|
|
|
|
|
by tivert
2736 days ago
|
|
> I created a Telegram group with a bot that censors the most common names and expressions on Brazilian partisan politics, using regular expressions. How did you handle simple substitutions and noise like rethuglicans -> rethug1icans or rethuglicans -> re.thug.licans? |
|
It will not handle some of them. But I discovered that partisan politics follow a Pareto rule, of sorts: 80% of the talk is around a small set of words. If you remove the adequate 80%, what remains is very ineffective, grotesque and pathetic communication. It is not enough to get people excited or willing to fight.
The tricky parts is to keep changing the set of words and regular expressions. Particularly on the months before an election the terms to filter go through intense change. After that they remain very stable.
Edit: I am trying now to use the Levenshtein distance[1] algorithm to preemptively detect the tricks you describe, of people deliberately changing some word in order to fool the regular expressions.
[1] https://en.wikipedia.org/wiki/Levenshtein_distance