| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by neutrinobro 40 days ago
	Hardly seems worth the effort, perhaps things have improved since 2019. It would be interesting to see an updated benchmark, but if your going to end up with code that looks like C++ to get proper performance, you might as well write it in C++. My biggest problem with Julia is that they decided to use column-major indexing for multi-dimensional arrays (i.e. FORTRAN/MATLAB style). This makes interoperability with C/C++ and python numpy a real pain, since you can't do zero-copy array sharing between the two without one side being forced into strided-access. For that reason alone I haven't adopted it in any of my work-flows.

2 comments

adrian_b 40 days ago

Actually the column-major order of Fortran is more efficient for some linear algebra operations than the order of C, which has been inherited by many modern languages that do not care about high performance in scientific computations.

So I would say that the culprit for interoperability is C and its descendants, not Fortran or Julia. The designers of C and of the languages that have imitated C have not given any thought about which order for multi-dimensional arrays is better, so the users of such languages do not have any right to blame for interoperability other languages that have done the right thing. Even if the Fortran order had not been better, it had already been used for 20 years before C, so there was no reason to choose a different order.

C has chosen to store arrays in the order in which they are typically read by humans when written on paper, but this is a choice like the choice between big-endian and little-endian, where big-endian was how Europeans wrote numbers, but little-endian is more efficient on computers.

An example of why column-major order is preferable, is the matrix-vector product, i.e. the evaluation of a function that maps linear spaces.

The matrix-vector product should not be done as it is typically taught in schools, by scalar products of rows of the matrix with the vector, because this is less efficient, by making more memory accesses. The right way to compute a matrix-vector product is by doing AXPY operations between columns of the matrix and the vector operand (segments of the output of the AXPY operations are held in registers until all partial AXPY operations are accumulated, avoiding memory accesses). In this case, you need to read columns of the input matrix for each AXPY operation, which is much more efficient when the elements of a column are stored compactly in memory, avoiding the need of strided accesses.

The same thing happens for matrix-matrix products, which must not be done in the naive way taught in schools, by scalar products of rows of the first matrix with columns of the second matrix, but it must be done by tensor products of columns of the first matrix with rows of the second matrix.

link

ilayn 39 days ago

> Actually the column-major order of Fortran is more efficient for some linear algebra operations than the order of C, which has been inherited by many modern languages that do not care about high performance in scientific computations.

This is a plausible assumption to make but unfortunately it is not true at large. Especially when the traditional sizes are exceeded say n >= 2000 certain operations such as LU can be improved in terms of performance with C-major arrays. However the correct statement is you lose at some place you win at other. There are certainly linalg operations that F-major can give you more performance. However it is also true for C-major layout.

In your example matrix vector product or any BLAS2 or BLAS3 level operations you can also swap out the for loop order to convert things around (row*col buffer multiplication vs sum of weighted column sum interpretation). In particular matrix norm operations are the only exceptions (abs column sum, row abs sum etc.) that certain norms prefer certain orders. In fact if you go into the Goto method deep enough you'll see that internal order is a bit like Morton ordering to fit things into L1 Cache.

The reason why column-major is preferred is historical and requires more surgery to get it running with C-major ordering. Trust me I tried but it's too much work to gain not so much. Maybe someday when I retire I can attempt it. Hence I kept it column major in my retranslation of LAPACK https://github.com/ilayn/semicolon-lapack

Instead I implemented a "high"-performance AVX2 matrix transpose operation so that swapping the memory layout is trivial compared to the linalg cost.

link

csvance 40 days ago

Just reverse the axis on one side, typically the Julia side. This is the convention used in Lux.jl/Flux.jl. I share memory between the two with zero additional copying for my workflows on a daily basis. If you are really allergic to doing this, I’m sure it’s possible to use metaprogramming / the type system to write it the same way in both places with zero performance overhead.

link