Hacker News new | ask | show | jobs
by jamiedamien 1603 days ago
Oh man, I've been down the rabbit hole of reducing matplotlib PDF sizes too many times. Ghostscript is great most of the time, but as mkl points out, it can make some PDFs bigger.

In particular, matplotlib plots that use points (markers) blow up in size quite a bit after processing through ghostscript, due to the way matplotlib re-uses spline information to draw the e.g. circles, where as ghostscript seemingly cannot / chooses not to (?). I recall something to do with xobject re-use...

I've also found that if you use type 42 fonts (helpful if submitting to a conference where the submission system doesn't accept type 3 fonts), matplotlib will not subset the font, resulting in increased file sizes.

So I use a similar ghostscript script, but one that also checks if the resulting file is actually smaller. If it's bigger, it just uses the original PDF. For files with lots of points, I've found that rasterizing just the points artist is a good solution (everything else in the plot is still vector), which allows for ghostscript to subset the type 42 fonts without the file-size explosion due to the points. Still, I wish there was a good way or script to e.g. just subset fonts in a PDF file, as well as processing a PDF to remove redundant fonts.

When including many PDF plots into a large LaTeX document, each PDF still comes with embedded fonts, which can increase the file size of the final PDF. Most of the fonts end up being duplicates. For this, I use a custom matplotlib backend that creates a PDF file with no text, together with a PGF file that specifies the position of each text. LaTeX then handles all the text rendering (which results in nice looking figures!), so each font is only included once in the final PDF.

1 comments

Wow, this sounds really cool! Out of curiosity, did you get bad results with the pure PGF backend? (And can you link to your script by any chance?) I'm always amazed that including matplotlib plots in LaTeX documents is so fraught since it's such common use case.

Also I read that matplotlib 3.5 has some sort of improved support for type 42 subsetting. I haven't had a chance to try it out yet but this could be a welcome improvement!

Oh didn't know about the improved type 42 font support in the new matplotlib! That's good to know and I should check it out.

And good point, the PGF works just as well (results should be identical), but since all the plot information has to be compiled by latex, it ends up ballooning the compilation time of the tex doc and the matplotlib PGF page suggests that you can run into memory issues as well. I was doing this for a thesis with 50+ plots and so still wanted compilation to be fast.

I've suggested this as an improvement to matplotlib, but unlikely to be merged since maybe it's a bit hacky (although it's very similar to what Inkscape's export to LaTeX option does): https://github.com/matplotlib/matplotlib/issues/22297 (the backend file can be found here: https://github.com/matplotlib/matplotlib/files/7921801/backe...)

And the gs script is below:

  #!/bin/bash
  set -e
  set -o pipefail
  if [ -z $1 ]; then
    echo "Supply input output"
    exit 1;
  fi

  if [ -z $2 ]; then
    outfile="$(basename ${1} .pdf)-small.pdf"
    if [ -f $outfile ]; then
      echo "WARNING ${outfile} already exists."
      echo "Supply input output"
      exit 1;
    fi
  else
    outfile="${2}"
  fi

  gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH -dQUIET -dDetectDuplicateImages=true -r150 -sOutputFile="${outfile}" "${1}"

  pre_b=$(wc -c "${1}" | cut -d' ' -f1)
  post_b=$(wc -c "${outfile}" | cut -d' ' -f1)
  if (( $pre_b <= $post_b )); then
    echo "Original is smaller ($pre_b -> $post_b). copying..."
    cp "${1}" "${outfile}"
  fi