|
Oh man, I've been down the rabbit hole of reducing matplotlib PDF sizes too many times. Ghostscript is great most of the time, but as mkl points out, it can make some PDFs bigger. In particular, matplotlib plots that use points (markers) blow up in size quite a bit after processing through ghostscript, due to the way matplotlib re-uses spline information to draw the e.g. circles, where as ghostscript seemingly cannot / chooses not to (?). I recall something to do with xobject re-use... I've also found that if you use type 42 fonts (helpful if submitting to a conference where the submission system doesn't accept type 3 fonts), matplotlib will not subset the font, resulting in increased file sizes. So I use a similar ghostscript script, but one that also checks if the resulting file is actually smaller. If it's bigger, it just uses the original PDF.
For files with lots of points, I've found that rasterizing just the points artist is a good solution (everything else in the plot is still vector), which allows for ghostscript to subset the type 42 fonts without the file-size explosion due to the points. Still, I wish there was a good way or script to e.g. just subset fonts in a PDF file, as well as processing a PDF to remove redundant fonts. When including many PDF plots into a large LaTeX document, each PDF still comes with embedded fonts, which can increase the file size of the final PDF. Most of the fonts end up being duplicates. For this, I use a custom matplotlib backend that creates a PDF file with no text, together with a PGF file that specifies the position of each text. LaTeX then handles all the text rendering (which results in nice looking figures!), so each font is only included once in the final PDF. |
Also I read that matplotlib 3.5 has some sort of improved support for type 42 subsetting. I haven't had a chance to try it out yet but this could be a welcome improvement!