| > All frontier multi modal LLMs can do this There's reliable, and there's reliable. For example [1] is a conversation where I ask ChatGPT 4o questions about a seven-page tabular PDF from [2] which contains a list of election polling stations. The results are simultaneously impressive and unimpressive. The document contains some repeated addresses, and the LLM correctly identifies all 11 of them... then says it found ten. It gracefully deals with the PDF table, and converts the all-caps input data into Title Case. The table is split across multiple pages, and the title row repeats each time. It deals with that easily. It correctly finds all five schools mentioned. When asked to extract an address that isn't in the document it correctly refuses, instead of hallucinating an answer. When asked to count churches, "Bunyan Baptist Church" gets missed out. Of two church halls, only one gets counted. The "Friends Meeting House" also doesn't get counted, but arguably that's not a church even if it is a place of worship. Longmeadow Evangelical Church has one address, three rows and two polling station numbers. When asked how many polling stations are in the table, the LLM counts that as two. A reasonable person might have expected one, two, three, or a warning. If I was writing an invoice parser, I would want this to be very predictable. So, it's a mixed bag. I've certainly seen worse attempts at parsing a PDF. [1] https://chatgpt.com/share/67812ad9-f2bc-8011-96be-faea40e48d...
[2] https://www.stevenage.gov.uk/documents/elections/2024-pcc-el... |