Hacker News new | ask | show | jobs
by alokv28 4596 days ago
I'm trying to figure out where something like this fits into the python data ecosystem.

For datasets that fit in memory, Pandas seems like the best bet. Good I/O functions (JSON, CSV), easy slicing (numpy array-like syntax), and some sql-like operations (groupby, join).

For large datasets, you'd need a proper db.

So is Dataset then useful for datasets that cannot fit in memory but aren't too large?

3 comments

Simple.

Recently, I've been tasked with mapping all of our clients addresses to lat/long. I could've read the CSV and appended the results to each line. Or used a JSON file. That I would have to read/write every time.

Instead, I wrote some pseudo-helper to dump all the CSV data into a SQLite DB. Then I ran my script. Every time I found a lat/long, I could mark the client as "done" and add the lat/long for that client and every client that shared this address. When I had to cut my script because I saw one result from Google Maps was wrong, I could just edit it straight in SQL, mark it as "invalid" and relaunch my script: it started right back at the first undone row. Then I just had to select all the "invalid" results and search them manually or refine them so Google Maps would give me a proper result.

Dataset is useful for small data that is constantly being worked on.

(This answer is from a Ruby POV and the dataset I was working on had about 4K rows, which explains why a) some Python magic wasn't available to me, maybe it would have been perfect in Python world and b) I didn't want to play with streams on my files)

Of course I still need some automation to correctly use my "DataMiner" (as I called it) to the fullest. I'll use Dataset's API as a basis to rewite it correctly.

I know very little about what's available in Ruby, but I would have used the Pandas library to accomplish this task in python. Their in-memory data structure, a DataFrame, is more than capable of handling those operations.
I think it's for persistence, there's a lot more to storing mutable data on disk than reading and writing JSON or CSV or pickle files if you want it to be robust. SQLite is great for that sort of thing.

Also, it looks like it is a proper DB (access layer), point it at postgres or something and take away it's ALTER and CREATE permissions and you're good to go.

pickle? sqllite?