| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by throwaw33333434 947 days ago

Anyone has a way to improve pdf data extraction? I want to covert a table in pdf to a CSV.

so far the best performance has conversation to string

import fitz # PyMuPDF

pdf_document = fitz.open("foo.pdf") page_number = 1 page = pdf_document.load_page(page_number - 1) text = page.get_text("text")

response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ { "role": "system", "content": f""" ..... {text} .... """

If I try regular ChatGPT it takes 3 minutes to covert the table (I have to press continue). Is there a way to force API to create whole CSV? some sort of retry?

1 comments

simonw 947 days ago

I've had really good results from AWS Textract for that.

It's a bit of a pain to get started with, but if you have an AWS account you can find a UI for using it buried deep within the AWS web console.

link

throwaw33333434 947 days ago

is it any good for pdfs that are NOT images?

link