Hacker News new | ask | show | jobs
by skety 3045 days ago
I have to handle quite a lot of Excel document at work, much to my despair. Unfortunately we are not using GIT, but SVN. I ended up creating a small tool that extract the vba, named ranges plus the sheet in this format: "cell_adress,type,format,value" I have to commit those files alongside the spreadsheet, which is not great but had helped me a lot. Nice to hear others are working on the problem!
1 comments

Have you tried `textconv` instead? You can configure git to generate your CSVs on the fly and show the differences between CSVs when you do `git diff`. See: https://www.kernel.org/pub/software/scm/git/docs/gitattribut...
Speaking of text conversion and git, I usually don't commit non-textual data aside from necessary files like image and audio assets, but one time I commited some PDFs in a "samples" directory for a tool I made to extract some data from a set of PDF files, and later I removed one of them and observed that when I typed "git show" the diff showed the text contents of the PDF which I find rather mind blowing because of how much trouble I had experienced extracting text and how git was casually showing me an ascii rendering of the document more or less with good representation of the layout of the document.

This in fact prompted me to further investigate the open source text extraction tools on the market and I ended up finding one that was better than the one I had selected at first and which I had then been building upon. Happily my own tools were built in such a way that I could reuse most of the code I had written while using the previous tool, and in fact during the rewrite I also realized that I could write the new code in a much cleaner way and so there were basically only upsides to switching tool and rewriting some of my code :)

What tools did you end up settling on for PDF data/text extraction? I ask because I have a side project that I've been neglecting for far too long which depends in part on cleanly extracting text from PDF (other formats too but PDFs are by far the most headache inducing).
I use the pdf.load and pdf.tree.write of PDFQuery.

https://github.com/jcushman/pdfquery

> PDFQuery works by loading a PDF as a pdfminer layout, converting the layout to an etree with lxml.etree, and then applying a pyquery wrapper. All three underlying libraries are exposed, so you can use any of their interfaces to get at the data you want.

Here is a minimal Python script to dump the XML tree so you can load it in whatever other language you use and work with it from there.

    #!/usr/bin/env python3

    from pdfquery import PDFQuery

    pdf = PDFQuery("some_document.pdf")
    pdf.load()
    pdf.tree.write("some_document.xml", pretty_print=True, encoding="utf-8")
It doesn’t work perfectly with all documents but it works well with many. Give it a try.

    pip3 install pdfquery
Git really is awesome! Too bad we are only using SVN :-(

I like to have to text version of the workbook so I can query it using grep. We have multiple dozen of very similar workbooks (whoever set that up first was insane). Any change request involves updating all workbooks the exact same way. After I update them I can run something like this: grep "^A1" /*/sheet1.csv

Using this I get the value/formula and format of all workbooks for sheet1!A1. That way I am sure they have all been updated the same way. And this works with the VBA code as well.

Not the nicest system I have worked on, at least we don't have to touch it too often.

Excellent