| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jobigoud 2781 days ago
	Probably the original text was using ligatures for fi and fl and they got lost in conversion. https://en.wikipedia.org/wiki/Typographic_ligature#Stylistic...

1 comments

superkuh 2781 days ago

Yup. I have to manually detect and correct for all the possible ligatures in all possible unicode in my text to speech pre-processor scripts. I hate them.

link

dunham 2781 days ago

If you have a Unicode library available, you might try asking it to convert the text to NFKD or NFKC normalization form. This will take apart ligatures (the former will also take apart accented characters).

link

ahazred8ta 2781 days ago

"this gives us eﬃcient space-time trade-oﬀs" :-(

link

dotancohen 2781 days ago

Those are HTML entities. Most modern programming languages come with tools to decode this, e.g. in python:

    text = urllib.parse.unquote(text)

link

jwilk 2781 days ago

urllib.parse.unquote() is unrelated to HTML. It undoes URL-encoding:

https://docs.python.org/3/library/urllib.parse.html#urllib.p...

In Python ≥ 3.4, you can use html.unescape() to decode HTML entities:

https://docs.python.org/3/library/html.html#html.unescape

link

dotancohen 2780 days ago

You are 100% correct. I mixed the two encodings up. Thanks.

link