|
|
|
|
|
by hadley
4192 days ago
|
|
I thought that before writing dplyr, but now I see that there a big differences. Relational databases are designed to work with large datasets on disk, and to accept changes very rapidly. The demands for in memory data analytics are quite differnt. Columnar data stores are a better fit, but it's pretty easy to bang out efficient code for in memory data; it's much harder to work with out of memory data. |
|
I saw this benchmark a while back comparing Pandas to SQLite in-memory databases. While Pandas did edge out SQLite in several areas, it was by well under an order of magnitude: http://wesmckinney.com/blog/?p=414
Pretty solid performance plus the ability to work with large datasets on disk seemed like a pretty big win to me. I could imagine a set of SQLite extensions (a la spatialite) that could further optimize for various data.frame use cases. As an added bonus, the same libraries would be very portable between different languages--even languages that don't currently have something like dataframes.
EDIT: What I don't know about is memory efficiency. Perhaps SQLite isn't, but I'd not bet against?