|
|
|
|
|
by Macha
161 days ago
|
|
> The key insight is that bank statement PDFs are almost always columnar. Of course, this relies on the PDF having a proper text layer; if your bank sends you scanned images, you’re out of luck (though I’ve yet to encounter one that does). When you convert them to text while preserving the layout, you get something that looks like this: So I decided to try this out with my bank who's export options are (one of the mentioned slightly silly multi-line format) XLSX or PDF only, and it appears they've done some "encryption" (really a simple substitution cipher and an embedded font with the characters jumbled up so it renders correctly) to the PDF to prevent this. All the marketing text and headers are in the pdftotext output fine but the actual data is all accented and non-printable characters (also if you copy/paste out). The substitution cipher does seem stable across a few statements, but still seems like less work to work off the XLSX |
|
I guess nowadays it's very cheap to run a headless browser, screenshot the output, and run it through OCR.. hah, to prevent that they'd have to design their webpage as 1 full screen Captcha..