Hacker News new | ask | show | jobs
by eth0up 641 days ago
I used GPT4o to convert heavily convoluted PDFs into csv files. The files were Florida Lottery Pick(n) histories, which they deliberately convolute to prevent automatic searching; ctrl-f does nothing and a fsck-ton of special characters embellish the whole file.

I had previously done so manually, with regex, and was surprised with the quality of the end results of GPT, despite many preceding failed iterations. The work was done in two steps, first with pdf2text, then python.

I'm still trying to created a script to extract the latest numbers from the FL website and append to a cvs list, without re-running the stripping script on the whole PDF every time. Why? I want people to have the ability to freely search the entire history of winning numbers, which in their web hosted search function, is limited to only two of 30+ years.

I know there's a more efficient method, but I don't know more than that.

5 comments

I really appreciate you sharing your hands-on experience with a real-world scenario. It's interesting how people unfamiliar with traditional OCR often doubt LLMs, but having worked with actual documents, I know how inefficient classic OCR methods can be. So these minor errors don't alarm me at all. Your use case sounds fascinating - I might just incorporate it into my own benchmarks. Thanks again for your insightful comment!
This sounds like a fun and interesting challenge! I am tempted to try it on my own

I’m surprised an LLM actually works for that purpose. It has been my experience with gpt reading pdfs that it’ll get the first few entries from a pdf correct then just start making up numbers.

I’ve tried a few times having gpt4 analyze a credit card statement and it adds random purchases and leaves out others. And that’s with a “clean” PDF. I wouldn’t trust an llm at all on an obfuscated pdf, at least not without thorough double checking.

>then just start making up numbers...

Absolutely! It's a fucking criminal in that regard. But that's why everything is done with hard python code and the results are tested multiple times. As an assistant, gpt can be fabulous, but the user must run the necessary scripts on their own and be ever ready for a knife in the back at any moment.

Edit: below is an example of what it generated after a lot of debugging and hassle:

  import re
import csv from datetime import datetime

def clean_and_structure_data(text): """Cleans and structures the extracted text data.""" # Regular expression pattern to match the lottery data pattern = r'(\d{2}/\d{2}/\d{2})\s+(E|M)\s+(\d{1})\s-\s(\d{1})\s-\s(\d{1})\s-\s(\d{1})(?:\s+FB\s+(\d))?' matches = re.findall(pattern, text)

    structured_data = []
    for match in matches:
        date, draw_type, n1, n2, n3, n4, fireball = match
        # Format the date to include the full year
        date = datetime.strptime(date, '%m/%d/%y').strftime('%m/%d/%Y')
        # Concatenate the numbers, ensuring leading zeros are preserved, and enclose in quotes
        numbers = f'"{n1}{n2}{n3}{n4}"'
        structured_data.append({
            'Date': date,
            'Draw': draw_type,
            'Numbers': numbers,
            'Fireball': fireball or ''  # Use empty string if Fireball is None
        })
    return structured_data
def save_to_csv(data, output_path): """Saves the structured data to a CSV file.""" # Sort data by date in descending order sorted_data = sorted(data, key=lambda x: datetime.strptime(x['Date'], '%m/%d/%Y'), reverse=True)

    with open(output_path, 'w', newline='') as csvfile:
        fieldnames = ['Date', 'Draw', 'Numbers', 'Fireball']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for row in sorted_data:
            writer.writerow(row)
def main(): # Path to the text file txt_path = 'PICK4.txt' # Ensure this path points to your actual text file output_csv_path = 'output.csv' # Ensure this path is where you want the CSV file saved

    try:
        with open(txt_path, 'r') as file:
            text = file.read()
       
        cleaned_data = clean_and_structure_data(text)
        save_to_csv(cleaned_data, output_csv_path)
        print(f"Data successfully extracted and saved to {output_csv_path}")
    except Exception as e:
        print(f"An error occurred: {e}")
if __name__ == "__main__": main()
I had the same problem with a PDF schematic for a BTT Octopus 3d printer board (which is published on their Github repo).

Unsearchable, weird characters behind the curtain, and etc.

But I don't blame deliberate obfuscation (or any other deliberate attempt to hide information) at all.

Instead, I simply blame incompetence.

(There's a ton of shitty PDFs in the world; this is just an example that I've encountered recently.)

Off topic - but the obvious follow up question is why do you want people to have this ability to search the entire history?
Thanks for asking...

1) I'm a rebel

2) I am irritated by deliberate obfuscations of public data, especially by a source that I suspect is corrupt. Although my extensive analysis has not yet revealed any significant pattern anomalies in their numbers.

3) It's kind of my re-intro into python, which I never made significant progress in but always wanted to.

4) It's literally the real history of all winning numbers since inception. Individuals may have various reasons for accessing this data, but I've been using it to test for manipulation. I presume for most folks it would be curiosity, or gambler's fallacy type stuff. Regardless, it shouldn't be obfuscated.

I had suspected you’re are suspicious of manipulation. I have heard many rumors of lottery corruption and manipulation.

It’s certainly a big red flag if they are deliberately obstructing access to the data.

Make sense your project and I’d probably take 30 mins to look at the data if I came across it. I’m somewhat decent at data and number analysis so if there is something and enough people can easily take a look at it, then it might get exposed.

Interesting and good luck.

There are private APIs that have that data (now and history)

Do you think the official data published is 100% correct if they were trying to hide something?

I am honestly not certain why they obstruct easy access to the number history. It's obviously accessible, but only through manually parsing the PDF. Their prior embedded search function, approximately two years ago, would return all permutations of the queried number from day 1 to present. They modified it to exclude results more than two years old. The PDF contains the entire data set, but isn't searchable. Why? Dunno. But I'm cynical

I've also compiled a list of all numbers that have never occurred, count of each occurrence and a lot more. My anomaly analytics have included everything, as an ignoramous, I can throw at it; chi squared; isolated forest; time series; and a lot of stuff I don't properly understand. Most anomalies found have been, if narrowly, within expected randomness, but I intend to fortify my proddings eventually. Although I'm actually confident I'm barking up the wrong tree, the data obfuscation is objectively dubious, for whatever the reason.

I've worked in the field and it could just be that the developers in charge of the new site didn't know/care how to get the data from the old system.
Old Hanlon's razor. Maybe, but I'd rather assume malice when it comes to Florida Lottery