| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by T-R 1877 days ago

Data science does a lot of SQL-like and linear-algebra-like transformations over a lot of data, and needs it to be reasonably performant. This means you want to do things like minimize overhead of indexing into data, and use things like SIMD instructions/GPU or parallelize work. To do this, you generally want your data in column-major format - organized as objects of arrays, rather than arrays of objects. Dataframe libraries like Pandas (which uses optimized linear algebra libraries like BLAS/LAPACK under the hood, via numpy) and the Spark Dataframe API are for working with columnar data and getting performance via SIMD or parallelization, respectively.

Generally people start off by doing these computations in a series of batch jobs (an "ETL pipeline", orchestrated with something like Airflow), to transform data into whatever shape they ultimately want it in; streaming technologies like Spark Streaming and Kafka can help with incrementally adding new rows to your data, rather than recomputing the whole thing every batch-job run.

Whenever you want to involve multiple systems or multiple libraries in your dataframe transformations, there's potentially a lot of computational overhead in serializing the dataframes or just converting them between memory representations. Arrow is a standardized format, spearheaded by the person who wrote Pandas, that attempts to match the in-memory representation, so that whether you're passing the data between libraries in-memory or writing a file for some other system to read, no unnecessary transformations need to happen to work on the data.

1 comments

6gvONxR4sf7o 1877 days ago

> linear-algebra-like transformations

> To do this, you generally want your data in column-major format

I'd argue that the basic element of linear algebra is matrix vector multiplication, which I figured was best done row-major. Column major is great in other data use cases, but 'linear-algebra-like, therefore column major' doesn't feel right.

link

BenoitP 1877 days ago

I don't know about linear algebra, but column major lets you compress thus:

* Dictionary encoding: US,US,US,US,FR -> US:0,FR:1;0,0,0,0,1

* Run-length encoding: 0,0,0,0,1 -> 4x0,1x1

* Delta encoding: 0,1,2,3,4 -> 5x'+1'

* Storing the min and max for a chunk

Basically: exploit the data type to compress it.

Which enables very fast filtering and projections. (And now that the IO bottleneck has been managed you can do your gigantic logistic regression)

link

dmlorenzetti 1877 days ago

It sounds like you're thinking about the mat-vec operation in terms of "Grab one row of the matrix, take the dot-product with the vector, and repeat for each row of the matrix."

But it's also possible to think of it as "Grab one element of the vector, use it to scale the corresponding col of the matrix, and repeat, summing results." Both are efficient means of finding the result, and both have block-level versions that play nicely with the machine cache.

Meanwhile, linear algebra also often involves finding vector norms, and scaling vectors, and so on, and the way we usually set up tables means that the vectors of interest are generally columns of the data tables.

link

T-R 1877 days ago

This is what I was trying to get at - using column vectors gives good cache locality and lets you use SIMD for "multiply all of these by this scalar" for each column, and then for "sum all of these" for the resulting rows. I'd imagine it could also let you optimize multiplications into things like bit-shifts with minimal overhead as well, though I have no idea if that's done in practice. Maybe only tangentially related, but I feel like this talk on Halide[0] is really illustrative of the general concepts.

As others have mentioned, for some operations it can also save you from loading whole columns that aren't relevant for your transformation. The compression point in the sibling comment is definitely also relevant, especially for serialization. A whole lot of reasons to use column vectors.

Using "column-major" here might've been terminology abuse; sorry for the confusion.

[0] https://www.youtube.com/watch?v=3uiEyEKji0M

link

andylei 1877 days ago

"column" here refers to a type of data. let's say you have a bunch of records of purchases. one column would be price, another column would be quantity.

if you're doing a linear algebra like transformation, you want to do it on all the prices or all the quantities, and a linear algebra library expects a big array of numbers, which is why you have to transform your records into an array of prices and an array of quantities.

"column" here refers to properties of objects, and not rows vs columns with in an array of number

link