Hacker News new | ask | show | jobs
by bluehorseray 1134 days ago
pdftotext in python works pretty well:

  import pdftotext
  import sys
  
  with open("test.pdf", "rb") as f:
    pdf = pdftotext.PDF(f,physical=True)

  for page in pdf:
    print(page)
1 comments

As an extra bit of help, use the '-layout' option to give you a formatted text document that's laid out the same way as the original .pdf document:

       pdftotext -layout my_file.pdf
will produce 'my_file.txt',

Note also that the .pdf must contain TEXT as such. An IMAGE of text will not work as expected with 'pdftotext'.