Hacker News new | ask | show | jobs
by mc42 3404 days ago
A cursory Google search suggests you could use a package like poppler to convert the pdf to raw text, and then in theory use regex to create data your server could use and serve.

If the pdfs are published as scans like so many municipalities do, then OCR is the only way to go.

Either way, good luck and decently nice design.

1 comments

I really appreciate you taking a look as well as providing your feedback.

Regarding the design, I just wanted to get something out the door quickly with a suitable look out of the box, so I decided to use MaterializeCSS (http://materializecss.com/). It's getting the job done so far, but I may revisit the design after I get all the content up.

And I'll look into poppler. Thank you for the recommendation.

If the timetables aren't particularly easy to read or parse, OCR is going to be potentially wrong so you're going to have to check it, so you might as well do it manually whilst there's no clean technical way of doing it (maybe contact the companies if you get big numbers and ask about an arrangement?). You could setup a script, on a VPS if you're doing it that way, that checks the PDF daily, and if the file changes it notifies you - that'd be fairly trivial to setup.