Hacker News new | ask | show | jobs
by jobigoud 2781 days ago
Probably the original text was using ligatures for fi and fl and they got lost in conversion.

https://en.wikipedia.org/wiki/Typographic_ligature#Stylistic...

1 comments

Yup. I have to manually detect and correct for all the possible ligatures in all possible unicode in my text to speech pre-processor scripts. I hate them.
If you have a Unicode library available, you might try asking it to convert the text to NFKD or NFKC normalization form. This will take apart ligatures (the former will also take apart accented characters).
"this gives us efficient space-time trade-offs" :-(
Those are HTML entities. Most modern programming languages come with tools to decode this, e.g. in python:

    text = urllib.parse.unquote(text)
urllib.parse.unquote() is unrelated to HTML. It undoes URL-encoding:

https://docs.python.org/3/library/urllib.parse.html#urllib.p...

In Python ≥ 3.4, you can use html.unescape() to decode HTML entities:

https://docs.python.org/3/library/html.html#html.unescape

You are 100% correct. I mixed the two encodings up. Thanks.