Hacker News new | ask | show | jobs
by TipVFL 2779 days ago
This reminds of an eBook of Neuromancer that I read, it was occasionally missing the letter f. For the most part I just added it back mentally without really thinking about it, but then sometimes I hit a passage like this: "He turned, pulled his jacket on, and licked the cobra to full extension." That one took a moment.
3 comments

Probably the original text was using ligatures for fi and fl and they got lost in conversion.

https://en.wikipedia.org/wiki/Typographic_ligature#Stylistic...

Yup. I have to manually detect and correct for all the possible ligatures in all possible unicode in my text to speech pre-processor scripts. I hate them.
If you have a Unicode library available, you might try asking it to convert the text to NFKD or NFKC normalization form. This will take apart ligatures (the former will also take apart accented characters).
"this gives us efficient space-time trade-offs" :-(
Those are HTML entities. Most modern programming languages come with tools to decode this, e.g. in python:

    text = urllib.parse.unquote(text)
urllib.parse.unquote() is unrelated to HTML. It undoes URL-encoding:

https://docs.python.org/3/library/urllib.parse.html#urllib.p...

In Python ≥ 3.4, you can use html.unescape() to decode HTML entities:

https://docs.python.org/3/library/html.html#html.unescape

You are 100% correct. I mixed the two encodings up. Thanks.
I wonder if at some point your e-book went through macOS's Preview program.

At work I sometimes have to copy blocks of text from a PDF into another document. If I do it with Preview, I lose the fi and fl ligatures. It only happens with PDFs created in-house, so I guess it's some kind of stylistic thing that comes from the guy who lays out the PDFs.

I eventually learned to use Adobe's own Acrobat, instead, and it works fine.

Please send to bugreport.apple.com
I'd argue this is a feature and not a bug. When copy-pasting text from PDFs, I'd love to not have to deal with unicode and ligatures. There's another comment upthread here where someone's complaining about having to deal with unicode.

If Preview can do this automatically, please don't change that feature.

I think the GP commentor meant that the ligatures are converted lossfully into an arbitrary substituant character (e.g. fl -> l), rather than that they’re taken apart losslessly.
For clarity, I was describing how in Preview fl -> NULL.

Preview for some reason just drops it entirely.

In Acrobat, fl -> f and l adjacent.

Not gonna lie, I still can't parse what this is supposed to be after a number of readings. What's the actual sentence? I'm so curious :P
> licked the cobra to full extension

should be

> flicked the cobra to full extension

The cobra is a weapon in the Neuromancer's universe, something like an extendable knife/club.