Hacker News new | ask | show | jobs
by mmh0000 641 days ago
This sounds like a fun and interesting challenge! I am tempted to try it on my own

I’m surprised an LLM actually works for that purpose. It has been my experience with gpt reading pdfs that it’ll get the first few entries from a pdf correct then just start making up numbers.

I’ve tried a few times having gpt4 analyze a credit card statement and it adds random purchases and leaves out others. And that’s with a “clean” PDF. I wouldn’t trust an llm at all on an obfuscated pdf, at least not without thorough double checking.

1 comments

>then just start making up numbers...

Absolutely! It's a fucking criminal in that regard. But that's why everything is done with hard python code and the results are tested multiple times. As an assistant, gpt can be fabulous, but the user must run the necessary scripts on their own and be ever ready for a knife in the back at any moment.

Edit: below is an example of what it generated after a lot of debugging and hassle:

  import re
import csv from datetime import datetime

def clean_and_structure_data(text): """Cleans and structures the extracted text data.""" # Regular expression pattern to match the lottery data pattern = r'(\d{2}/\d{2}/\d{2})\s+(E|M)\s+(\d{1})\s-\s(\d{1})\s-\s(\d{1})\s-\s(\d{1})(?:\s+FB\s+(\d))?' matches = re.findall(pattern, text)

    structured_data = []
    for match in matches:
        date, draw_type, n1, n2, n3, n4, fireball = match
        # Format the date to include the full year
        date = datetime.strptime(date, '%m/%d/%y').strftime('%m/%d/%Y')
        # Concatenate the numbers, ensuring leading zeros are preserved, and enclose in quotes
        numbers = f'"{n1}{n2}{n3}{n4}"'
        structured_data.append({
            'Date': date,
            'Draw': draw_type,
            'Numbers': numbers,
            'Fireball': fireball or ''  # Use empty string if Fireball is None
        })
    return structured_data
def save_to_csv(data, output_path): """Saves the structured data to a CSV file.""" # Sort data by date in descending order sorted_data = sorted(data, key=lambda x: datetime.strptime(x['Date'], '%m/%d/%Y'), reverse=True)

    with open(output_path, 'w', newline='') as csvfile:
        fieldnames = ['Date', 'Draw', 'Numbers', 'Fireball']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for row in sorted_data:
            writer.writerow(row)
def main(): # Path to the text file txt_path = 'PICK4.txt' # Ensure this path points to your actual text file output_csv_path = 'output.csv' # Ensure this path is where you want the CSV file saved

    try:
        with open(txt_path, 'r') as file:
            text = file.read()
       
        cleaned_data = clean_and_structure_data(text)
        save_to_csv(cleaned_data, output_csv_path)
        print(f"Data successfully extracted and saved to {output_csv_path}")
    except Exception as e:
        print(f"An error occurred: {e}")
if __name__ == "__main__": main()