Hacker News new | ask | show | jobs
Converting Markdown to ePub or Mobi Using Pandoc (themythicalengineer.com)
128 points by sks147 1880 days ago
15 comments

Thanks for the nice tutorial OP.

Here's a public notebook in Deepnote if anyone wants to play around with the code or duplicate it: https://deepnote.com/project/Converting-Markdown-to-Epub-or-...

2 fun facts about Deepnote:

1. You can create a Custom environment by writing a Dockerfile with all the libraries you need to install and everytime you're in a need to re-use a similar functionality (e.g. convert yet another book to mobi), you can just fire it up and all will be preinstalled. https://docs.deepnote.com/environment/custom-environments

2. You can turn any notebook to a blogpost right away and publish within Deepnote directly.

Disclaimer: I'm a software engineer at Deepnote.

Wait this is a notebook similar to pythons' notebook but it's a docker environment where I can install a lot of stuff I want and then do even more stuff? Am i getting this right?

It's like a shell to a vm but in a notebook format that you can then use to blog?

Yeah, you got it right. You can even access the actual shell in the vm, not just the notebook environment.
Awesome! I will definitely give it a try
what would be advantages to going to Deepnote from regular Jupyter notebooks based workflow?

Let's assume someone who has been working with Jupyter notebooks(mostly Python based) for a long time.

Are Deepnote notebooks exportable?

The big worry is that you guys decide to pivot or radically change your pricing model and there is no offramp.

By comparison I don't mind using Google Colab. If Google Colab decides to shutdown or 100x their price I can take my .ipynb files and use them on my local littlest JupyterHub instance.

Deepnote internally supports .ipynb format and you can always export the Deepnote notebook to .ipynb similarly as you'd in Colab.

In general the main selling points are live collaboration (you can work on a notebook with you team as you'd do on a google doc), and integrations (you can plug-in your snowflake db, or s3 bucket or whatever, and have it connected for any further analysis, or a long-term training, etc.

For many non-software-developer data scientists, it's also easier to work in a cloud environment compared to installing stuff locally, and to version their notebooks in Deepnote instead of git. But this really depends on the particular workflow that one has.

Thank you for the answers!

I can absolutely see a need for collaboration tool. Collaboration on regular Jupyter is a pain. I create a shared folder for coworkers and well read/write permissions* are not fun.

* knows chmod - https://www.reddit.com/r/linux/comments/dily0/i_know_how_to_...

Thanks for publishing this as a notebook. I really love the platform.
I followed a similar approach for my novel; started with Markdown, used pandoc to convert it to epub/mobi, but also to LibreOffice .odt to generate the PDF for the paperback. Wrote some details about the process here: https://gabrielgambetta.com/tgs-open-source.html
That's how I wrote and self-published my book as well! Although, I created a script that turns md to epub/mobi/pdf using pandoc.

Here's how I did it in case anyone is interested: https://pascalprecht.github.io/posts/writing-an-ebook

This is great. Thanks for sharing
Don't want to be a troll, but if you are writing anything that is not a README and/or is a book or booklet or bookish, do yourself a favor and use Asciidoc instead.
This! Asciidoc is the grown-up brother of Markdown. Designed to scale up to entire books, handle images, tables, references, citation, book indexes, maths.

And the syntax is very friendly and intuitive.

And it's quite easy to insert compiled images, e.g. Graphviz, UML diagrams, Ditaa, just by having the SOURCE in your document.
Absolutely. AsciiDoc (2002) is actually 2 years older than Markdown (2004), but is surprisingly similar to write.

It was created as an equivalent to DocBook XML for the creation of book-length technical documents. It has a rich history and is well supported in many places (try writing a README.adoc for your next GitHub-hosted repo).

It is also currently undergoing a standardization process:

https://projects.eclipse.org/proposals/asciidoc-language

What's your suggested Asciidoc toolchain?

(Pandoc will ingest Asciidoc as well as Markdown, FWIW.)

At work, we have been using Asciisoc for 10 years or so, so we use Asciidoc -> XSLT -> PDF / chunked HTML. Works great, not nice to use.

For new projects, we use Asciidoctor - it does everything but chunked HTML, and it's a pleasure to use.

Thanks.
Give me a reason to use asciidoc rather than bookdown.
I assume the person you're replying to was referring to Markdown rather than Bookdown. It seems that Asciidoc was designed for the direction which Markdown has going with all these variants. If you find yourself chasing these Markdown variants to get more flexibility, then Asciidoc might be what you're looking for.

I don't know anything about Bookdown and it may be similar to Asciidoc. I would be willing to bet that Asciidoc would be around longer than most of the MD variants though.

Why might you want to continue using MD flavors? You already have loads of MD docs and you have no control over the processes which create them (shared environment, higher-ups force you to use MD, etc.)

NOTE: If you're interested in Asciidoc, also take a look at Asciidoctor.

Thanks. That's basically what I figured.

One feature I wanted that made me go with bookdown is that the static website it creates has search builtin.

Pandoc has been one of the best tools I have used and this blogpost is well written
This is a nice tutorial, thanks for submitting it! However, for me, the biggest discovery was epub.press [0]. I just tried for couple of open pages, it works quite well!

[0] - https://epub.press/#about

Why wget|dpkg and wget|sh instead of apt to download Pandoc and Calibre?

You should be able to replace all this:

    !wget https://github.com/jgm/pandoc/releases/download/2.11.3.2/pandoc-2.11.3.2-1-amd64.deb
    !sudo dpkg -i pandoc-2.11.3.2-1-amd64.deb
    !apt install libgl1-mesa-glx -y
    !wget -q -O- https://download.calibre-ebook.com/linux-installer.sh | sudo sh /dev/stdin
With simply this:

    !apt install pandoc calibre
Calibre website strongly recommends downloading from their site instead of OS packages, mentioning that the packages are often out of date. And I've generally found this to be true - Calibre versions on package repos are often several versions behind, more than the usual "package maintainer trying to play catch up" differences.

I'm usually averse to the wget|sh installs, but in this case it seems worth it. You can inspect the .sh file (which is really mostly Python code) before running it, just to not get into the bad habit of directly piping in code from the internet.

> You can inspect the .sh file

That's not the issue. Installing software this way means you have no automatic updates in the future. It's fine if you re-run the install script on a regular basis (eg. by recreating containers) AND you don't pin versions

But OP's instructions fail both: there is no mention of updates, and they pin the pandoc version.

I use pandoc in a CD pipeline, the version in the repos is stale compared to upstream (normal, that's how it is) unless you're on a rolling distro like Arch.

I have reported pandoc bugs and had them fixed (great dev team), pulling the latest single-DEB install (no deps, unlike the one in the Debian repo) and using it gets all the latest updates which matter to a process like this.

In this particular case your needs to use the latest pandoc lead to the wget pull and install, which thanks to their DEB design is easy and clean to do in an ephemeral CI container.

Do you have any more details about how you integrated Pandoc into your pipeline? A post or something?
Sure thing, it's pretty simple and straightforward I can post right here. In your CI/CD runner, you add a "before" script like so (Gitlab YAML example):

    image: debian:latest

    before_script:
        - bash myscript.sh
Your myscript.sh can be as simple as four lines (one to install curl, it's not a default on Debian), example:

    apt-get -y install curl
    VERSION=$(curl -s "https://api.github.com/repos/jgm/pandoc/releases/latest" | grep -Po '"tag_name": "\K.*?(?=")')
    curl -sLo "pandoc-${VERSION}-1-amd64.deb" "https://github.com/jgm/pandoc/releases/download/${VERSION}/pandoc-${VERSION}-1-amd64.deb"
    apt-get -y install "./pandoc-${VERSION}-1-amd64.deb"
The Github API used above has the nice default of listing the latest release as you see used there in the grep on the right, one could enhance that with `jq` for higher intelligence but this very simple setup is functional as a starting point to develop your own style.
The tutorial is presented well. My biggest takeaway was that one can use 'Deepnote' to run Linux commands.

If you are interested in knowing how to customize `pandoc` for generating PDF/EPUB, I have a tutorial [0] based on books I've written. I also have links at the end with related resources, including tools others than `pandoc`.

[0] https://learnbyexample.github.io/customizing-pandoc/

Jupyter notebook also has the same feature. `!shell command` or start a cell with `%%bash` and everything in it will run through the terminal, not the notebook interpreter.
That'd require you to have access to a *nix terminal on your system. Deepnote is allowing access through their servers, so for example, you can try this tutorial on Windows.

See this comment from Deepnote engineer for more details: https://news.ycombinator.com/item?id=26899905

Note that mobi format is being deprecated by Amazon. If you're producing a file for distribution by Amazon, you only need the epub file.
However, Kindles can't read epub directly. If you're trying to produce a file you can sideload on your (or your customer's) Kindle without going through Amazon, you're still gonna need mobi, or one of the later proprietary (and undocumented) Amazon mobi variants.
True, but I wouldn't count on that being supported forever.
The day Kindles stop supporting side-loaded third-party content is the day I stop buying Kindles.
Is Pandoc being used mostly to join the files?

I recently started converting Markdown files to epub (and kepub) for my new Kobo. I load the Markdown straight into Calibre though.

On a side note, is there some benefit to mobi over epub? Kepub seems to be the preferred format on Kobo, because for some reason it turns pages _much_ faster than epub and gives access to reader statistics (if one cares about that).

Hey, I'm curious about your Calibre usage! I'm working on turning a written book to Markdown, and pandoc has a real pain point on links between chapters. Please tell me more!
What would you like to know? In Calibre you can set regex to indicate what should be used for chapter, sub chapter etc, and it can be used to generate the TOC. So I use Markdown headings, #, ##, ### etc for chapters and subsections.
Checking, it looks like Calibre expects one Markdown file as input, where I have a few Markdown files, linking to each other in a way that works on GitHub. It looks like the sort of thing that either works as-is with luck, or is a pain in the neck and needs massaging.
Ah, yes, I've only been doing this with single markdown pages. I believe people use pandoc for multi-page.

GitHub style linking doesn't really lend itself to "book" format so I suppose there's no auto way to do that.

>On a side note, is there some benefit to mobi over epub?

AFAIK Amazon Kindle devices can read mobi files but not epub files (unless you convert them to something else first).

(The mobi format is older, so if you want to read an ebook on your old Palm Pilot PDA then you'll probably want mobi.)

Thanks for the heads up! As for my palm pilot, well, I'm not sure which drawer that's been in for the last decade or two.
Good stuff. When I wrote an eBook I found the extra features of reStructuredText to be useful (index, glossary, graphviz & Tikz environments, etc.) and wrote a sort of similar post.

https://digitalsuperpowers.com/blog/2019-02-16-publishing-eb...

Yeah the rSt markup is pretty idiosyncratic and both anal-retentive and a bit inconsistent (e.g. it can be hard to internalise where it does or does not want blank lines) but Sphinx is an actual document system, having to work with markdown for anything beyond a single file quickly gets painful.
You know, I bet it wouldn't take that much to go from epub to PDF, suitable for printing and binding. The structural information is pretty much all there, I think - it'd just need pagination and formatting for print, really.

I'd definitely want to use such a thing, as a way to feed my bookbinding hobby. I wonder if anyone else would?

Is the resulting typesetting any good?
Epub is just html with css and images in a zip file. So the quality of the typography will largely depend on the renderer, not the file itself.
Unfortunately, the renderer that gets used will sometimes itself depend on the file. For example, if you transfer the generated Mobi file to a Kindle as the article says, it will get rendered using an inferior renderer (in terms of kerning, for example) compared to the renderer that would've been used had the file been a KFX ("Kindle Format 10").
The end result would largely depend on the quality of generated HTML and CSS, no? That's what the OP is most likely asking about.
The result is always an epub/mobi so the source doesn’t really matter.
I would like to have better type setting. It is a pleasure to have a nicely layouted and set page. My layman's take would be that the rendering engines have headroom to improve but most readers don't really care.
Pandoc is a GOAT tool! It's so good.
very interesting