| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kerkeslager 2737 days ago
	As long as those spreadsheets/database files are accessible to someone with technical skill, people can pull in the data and use tools to make it more accessible and useful. Ideally, yes, the data is useful to begin with, but as long as it's available, there's nothing stopping individuals with the skills from making it useful. Of course, there are exceptions: the PDFs that are often provided by the prosecution as part of the discovery process are prohibitively difficult to deal with, and should be considered a violation of Brady vs. Maryland, IMO.

2 comments

_bohm 2737 days ago

I've spent a great deal of time parsing data out of government PDFs that isn't attainable by any other means as a part of my job. In the process I've learned how difficult this information can be to access even for people who don't require it to be in a machine readable format. It certainly has been an interesting exercise in how far simple web scraping tools can be pushed, though.

link

ams6110 2736 days ago

Amazon Textract was recently announced, sounds like it might be good for that. Haven't tried it myself.

https://aws.amazon.com/about-aws/whats-new/2018/11/introduci...

link

_bohm 2736 days ago

I applied to the beta, but they never got back to me :\

link

bpchaps 2736 days ago

Have you tried Apache's tika? It's pretty decent.

link

_bohm 2736 days ago

Nope, I'll have to give it a spin. Thanks for the recommendation!

link

TheAceOfHearts 2736 days ago

Do you have any tool suggestions or general advice for someone trying to do this? A while back I was trying to extract text from some government PDFs in order to make the information more accessible for others, but I became a bit overwhelmed when I started reading up on PDFs.

link

_bohm 2736 days ago

Sure! In terms of raw text extraction (for documents that don't require OCR), the most useful tools I've worked with have been pdftotext [0] and PyMuPDF [1]. For extracting useful details, really, my best advice is to make sure that your regex skills are sharp. I've been meaning to explore the possibility of using NLP tools for named entity recognition, but unfortunately I don't have much of a background there.

The rest kind of it kind of just comes down to using good software engineering practices to help keep yourself sane. Find useful abstractions for common tasks you need to perform and build a library around them, make sure that your data processing pipeline is designed with enough flexibility to handle inputs in different formats so that adding or modifying parsing logic becomes trivial, etc.

[0] https://www.xpdfreader.com/pdftotext-man.html [1] https://pymupdf.readthedocs.io/en/latest/

link

ocrcustomserver 2736 days ago

pdfminer is another good library (Python).

link

ethbro 2737 days ago

Exactly. Accessible and machine readable are necessary but not sufficient. Thankfully, civil society can reasonably pick up the slack.

In regards to modern day transparency requirements, it seems like laws should include a reasonableness clause.

Making records available to the public but requiring them to be hand photocopied vs. making them available in electronic form in a custom format.

Both open. But two very different magnitudes of effort.

link