Hacker News new | ask | show | jobs
by azemetre 860 days ago
Interesting!

I’ve always held on to my grocery receipts for the last 8 years. I took pictures of them but when I moved a box got water damaged so now I only have like the last 3ish years.

Is there any open source software that I can use to transfer these receipts into a useful csv?

I have an idea for a few interesting data visualizations as I’d often buy the same things every week. Grocery bill went from like $70 to $150 with not much changes from what I can tell.

Would be cool to put it out in the public.

5 comments

https://docs.paperless-ngx.com

Nextcloud also has OCR. You can use a scanner with either.

Avoid touching the receipts. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5453537/

didn't see anything specifically receipt oriented but found a little blurb here that mentions receipts almost tangentially https://docs.paperless-ngx.com/usage/#basic-searching
I run paperless on an Unraid server at home and it works really well. It has "machine learning" (based on a model that you host yourself that grows as you use it). It has good search, impressive OCR, and generally works really well.

My only complaint is that it lacks a good organization workflow. I have a shared network folder and any file (image, pdf, etc) that you put on that folder gets immediately consumed into Paperless. This happens almost immediately. I have a printer/scanner that allows me to scan to a SMB network drive. So I configured any scans from it to go to that shared folder, which makes integration really nice. I also use GeniusScan on my phone to scan to the same network drive (which requires pro, which is ~$16 a year I think). Genius Scan can save locally to your phone and upload later when you get home, which makes for a good workflow. The problem is that once it gets into Paperless, there's not a good workflow for reviewing and labeling the file. I have been meaning to sit down and provide a contribution to the open source project to improve this, but haven't found the time to do it yet. This is the biggest weakness of the project imo.

For those that have never used paperless. The naming may confuse you. It started off as an open source project named paperless. Then it got abandoned and a team picked it up to update it and make it more modern, and they renamed it paperless-ng (for angular I assume, the new frontend). Then that project lost momentum, so it was forked again and is now paperless-ngx which is the current iteration of it. It currently has a very strong community and gets good updates.

hmm, I guess I have a weekend project now.

And thanks for the heads up about the toxicity, I use to save them all but after the move I simply take a picture with my phone and throw them out.

paperless doesn't seem to be my exact use case but hopefully after it does the OCR transformation it can allow you to make a csv file.

I'd look into a document scanner for ease of use. They even have ones that auto loads, so no more waiting around. With that said, if you purchase a scanner, it probably already has proprietary OCR, and they have auto feeding ones for many documents. I foolishly bought one not knowing auto feeding was an option. https://youtu.be/fi0ZhTFaW7w I bought a brother 2 sided one since it had Linux drivers.
hmm, IDK if a scanner would help me. I already have pictures of my receipts. I might have to do more research because I feel like there's gotta be something out there where you can just show images of receipts and have it generate a csv of the data.

I'd even pay a decent amount to do it. After doing some more research it seems like MS Office might handle this workflow too in Excel (convert receipt picture to csv data).

I have attempted this and the biggest issue is that sometimes the receipts use codes hard to understand. And the codes will change from store to store.

If you're lucky, you won't need to go to a grocery store and determine what a code means, you will only need to map the code to an actual item you bought.

That’s perfectly fine for me. I can map the key items myself, the hard part is I don’t want to devote a solid 120+ hours manually creating the CSVs for 150 receipts.

Is it possible you can discuss more what you did?

ChatGPT Vision will do well with this kind of OCR stuff. Just give it the header and a few example rows to get back consistently formatted output.

Or use JSON mode with the API.

I'd love it if Apple (since they appear not to sell data) would provide an anonymized receipt analysis service like this.

Huge user base right off the bat.

I wish there was just a standardized receipt sending format that would send your receipt to your pay app once you tap to pay something.

Of course, I can only imagine this coming to fruition if it is packed to the brim with tracking and 3rd party dissemination.

Yes, just awkwardly punch in your email address into this miscalibrated POS keypad so we can “only send you your receipt”*

* May also enroll you in “helpful” store newsletters and sell your email to every company on your grocery store receipt.

To the brim, for sure.

Probably one day we'll have a "citizen preferences file" where the confidentiality of our interactions with various entities can be granularly set.

And who is the business consumer
Hmm, yeah I suppose anything promoting consumer knowledge would be antithesis to ordinary business schemes.

Pie in the sky, alas.

Maybe they could grab some preventative lobby money with the threat of it. /s

I do the same with WholeFoods receipts. The pipeline is:

1. Scan to FTP dir to TIFF

2. Nightly job submits image(s) from the dir to Veryfi (their free tier is enough and they looks like the best for receipts OCR)

3. Save that raw JSON, enhanced JSON (fix occasional mis-attribution for discounts, calculate unit price), and CSV. Filename is a purchase date - correctly extracted by Veryfi.

4. Render with bash + gnuplot.

5. TODO: store into some DB and render with Grafana or something.

Edit: formatting

I agree with the other poster, do you have a repo someone willing to share?
Github repo? ;)