| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hadley 4192 days ago
	I thought that before writing dplyr, but now I see that there a big differences. Relational databases are designed to work with large datasets on disk, and to accept changes very rapidly. The demands for in memory data analytics are quite differnt. Columnar data stores are a better fit, but it's pretty easy to bang out efficient code for in memory data; it's much harder to work with out of memory data.

1 comments

peatmoss 4192 days ago

> large datasets on disk

I saw this benchmark a while back comparing Pandas to SQLite in-memory databases. While Pandas did edge out SQLite in several areas, it was by well under an order of magnitude: http://wesmckinney.com/blog/?p=414

Pretty solid performance plus the ability to work with large datasets on disk seemed like a pretty big win to me. I could imagine a set of SQLite extensions (a la spatialite) that could further optimize for various data.frame use cases. As an added bonus, the same libraries would be very portable between different languages--even languages that don't currently have something like dataframes.

EDIT: What I don't know about is memory efficiency. Perhaps SQLite isn't, but I'd not bet against?

link

IndianAstronaut 4192 days ago

I personally switched from Pandas to SQL. While Postgres is a heavy duty database for large production operations, it is fully capable of doing day to day analysis of CSV files with nice SQL syntax.

There were two reasons for the switch. SQL syntax is cleaner and more well understood by others. The second is if you get a dataset bigger than memory, you aren't stuck.

link

hadley 4191 days ago

That benchmark is only for joins? That's a pretty small part of analytic workflows in my experience.

link

peatmoss 4191 days ago

That's fair. Now I'm curious as to how a more complete set of benchmarks would look using in memory sqlite, and what the opportunity for extension would be.

link

arun_sriniv 4186 days ago

Unfortunately the datasets in that benchmark less than 3MB each in size - it fits entirely in cache. It doesn't give a good indication of how well the function/implementation scales on bigger data sizes that really matter (in terms of computation time, memory, how cache efficient it is etc..). How much does one really care about 0.018 vs 0.023 seconds?

link