Hacker News new | ask | show | jobs
by peatmoss 4192 days ago
> large datasets on disk

I saw this benchmark a while back comparing Pandas to SQLite in-memory databases. While Pandas did edge out SQLite in several areas, it was by well under an order of magnitude: http://wesmckinney.com/blog/?p=414

Pretty solid performance plus the ability to work with large datasets on disk seemed like a pretty big win to me. I could imagine a set of SQLite extensions (a la spatialite) that could further optimize for various data.frame use cases. As an added bonus, the same libraries would be very portable between different languages--even languages that don't currently have something like dataframes.

EDIT: What I don't know about is memory efficiency. Perhaps SQLite isn't, but I'd not bet against?

3 comments

I personally switched from Pandas to SQL. While Postgres is a heavy duty database for large production operations, it is fully capable of doing day to day analysis of CSV files with nice SQL syntax.

There were two reasons for the switch. SQL syntax is cleaner and more well understood by others. The second is if you get a dataset bigger than memory, you aren't stuck.

That benchmark is only for joins? That's a pretty small part of analytic workflows in my experience.
That's fair. Now I'm curious as to how a more complete set of benchmarks would look using in memory sqlite, and what the opportunity for extension would be.
Unfortunately the datasets in that benchmark less than 3MB each in size - it fits entirely in cache. It doesn't give a good indication of how well the function/implementation scales on bigger data sizes that really matter (in terms of computation time, memory, how cache efficient it is etc..). How much does one really care about 0.018 vs 0.023 seconds?