|
|
|
|
|
by benjaminwootton
1201 days ago
|
|
The benefit of “in process” isn’t really clicking for me. We had Pandas or similar if we wanted to load and transform some data in memory. SQL is nicer than Pandas APIs but not sure that’s a killer feature? If we have a lot of data and multiple people working with it then it makes sense to centralise it in a database or warehouse where it’s then easy to access via SQL anyway. We can query files on S3 with it and have the processing locally, but then we have network latency because compute and storage are further apart. There’s a cost benefit here because we don’t need to run a server which could be significant. It’s nicely implemented but I’m not seeing a big gap that it fills? |
|
Lots of users: (a) mentally align better with SQL than pandas APIs, regardless of whether they know both or not (b) want decent performance on their analyses, which they aren't getting from pandas, and won't get from many OLTP-databases they're using over-the-wire (c) want ease of accessing parquet and csv locally and remotely with minimal development overhead. it's super simple with duckdb.
Nonetheless, some of the other things you pointed out are some tradeoffs. We're building a serverless cloud capability at motherduck on top of duckdb in order to address some of these and optimize compute and storage based on data locality, bandwidth and the need for collaboration amongst multiple people.