|
|
|
|
|
by alokv28
4596 days ago
|
|
I'm trying to figure out where something like this fits into the python data ecosystem. For datasets that fit in memory, Pandas seems like the best bet. Good I/O functions (JSON, CSV), easy slicing (numpy array-like syntax), and some sql-like operations (groupby, join). For large datasets, you'd need a proper db. So is Dataset then useful for datasets that cannot fit in memory but aren't too large? |
|
Recently, I've been tasked with mapping all of our clients addresses to lat/long. I could've read the CSV and appended the results to each line. Or used a JSON file. That I would have to read/write every time.
Instead, I wrote some pseudo-helper to dump all the CSV data into a SQLite DB. Then I ran my script. Every time I found a lat/long, I could mark the client as "done" and add the lat/long for that client and every client that shared this address. When I had to cut my script because I saw one result from Google Maps was wrong, I could just edit it straight in SQL, mark it as "invalid" and relaunch my script: it started right back at the first undone row. Then I just had to select all the "invalid" results and search them manually or refine them so Google Maps would give me a proper result.
Dataset is useful for small data that is constantly being worked on.
(This answer is from a Ruby POV and the dataset I was working on had about 4K rows, which explains why a) some Python magic wasn't available to me, maybe it would have been perfect in Python world and b) I didn't want to play with streams on my files)
Of course I still need some automation to correctly use my "DataMiner" (as I called it) to the fullest. I'll use Dataset's API as a basis to rewite it correctly.