|
|
|
|
|
by T-R
1877 days ago
|
|
Data science does a lot of SQL-like and linear-algebra-like transformations over a lot of data, and needs it to be reasonably performant. This means you want to do things like minimize overhead of indexing into data, and use things like SIMD instructions/GPU or parallelize work. To do this, you generally want your data in column-major format - organized as objects of arrays, rather than arrays of objects. Dataframe libraries like Pandas (which uses optimized linear algebra libraries like BLAS/LAPACK under the hood, via numpy) and the Spark Dataframe API are for working with columnar data and getting performance via SIMD or parallelization, respectively. Generally people start off by doing these computations in a series of batch jobs (an "ETL pipeline", orchestrated with something like Airflow), to transform data into whatever shape they ultimately want it in; streaming technologies like Spark Streaming and Kafka can help with incrementally adding new rows to your data, rather than recomputing the whole thing every batch-job run. Whenever you want to involve multiple systems or multiple libraries in your dataframe transformations, there's potentially a lot of computational overhead in serializing the dataframes or just converting them between memory representations. Arrow is a standardized format, spearheaded by the person who wrote Pandas, that attempts to match the in-memory representation, so that whether you're passing the data between libraries in-memory or writing a file for some other system to read, no unnecessary transformations need to happen to work on the data. |
|
> To do this, you generally want your data in column-major format
I'd argue that the basic element of linear algebra is matrix vector multiplication, which I figured was best done row-major. Column major is great in other data use cases, but 'linear-algebra-like, therefore column major' doesn't feel right.