Hacker News new | ask | show | jobs
by yodelshady 1918 days ago
Open question: why are multi-dimensional arrays, and matrices specifically, so neglected in almost every other language?

They map well to practically freakin' everything, for what seems like.. not that much effort on the language design side, but an enormous amount of tedious, duplicated effort on the user side.

6 comments

That is a great question, and while I don’t know for sure, I think a relevant anecdotal observation that stands out to me is that many languages which do have first-class multidimensional arrays also have one-based indexing. And I would speculate this in turn is because most people who wanted matrices badly enough to make them a language feature wanted to do linear algebra with them, where you generally also want one-based notation since that is how all the equations in textbooks and papers are written.

So a language with multidimensional arrays is in a lose-lose position of having to choose to either satisfy the linear algebraists at the cost of alienating general-purpose programmers who want to do pointer arithmetic, or else satisfy the latter while alienating the core demographic for multidimensional numeric arrays.

Personally, I’m fine with (or even slightly prefer) one-based for my own scientific computing, despite starting with C, since it really is more elegant for linear algebra, and I have never found myself needing or wanting to do pointer arithmetic in a language that does have good multidimensional arrays — but clearly it is still a major turn-off to many others.

I don't get why a language can't have both 0 and 1 based indexing, just set it as a compiler flag or something. They are a homomorphism, it's not hard to add or subtract one.
That sounds like it belongs in the top 10% of the worst (programming-related) things I could wish upon another programmer: Having to maintain projects where you have to read the build system's configuration and keep it in mental context just to figure out if an array access is mathematically correct or not.

I would imagine this support to be stomachable if it was active at all times but using a visually different syntax for 0-based and 1-based indexing, maybe using something like the guillemet for the latter (like this: «1») that is wild enough to draw the user's attention when reading the code, and not common/prominent enough across keyboard layouts to make most users think twice before using it when writing code unless it makes sense for the domain.

It's hard enough to understand a codebase, but if a language was like Haskell for instance (you can enable certain extensions in Haskell with a flag at the top of a file), you could have a line at the top of a file defining that this file is 0 based, while maybe the default is 1 based for the language. It doesn't seem like that much more than you already have to consider in many languages.
Funny you put it like that, because the fragmentation regarding language extensions is frequently discussed as one of the biggest issues affecting real world Haskell.
It's mostly talked about by people that don't use Haskell! In practice it's not really a problem.
You can actually switch starting indices fairly easily in Fortran [1], or for that matter Julia, but for some reason the option doesn’t seem to be particularly widely used. People seem to give a lot of weight to the default, even when the effort to change is minimal. Added complexity/ambiguity?

[1] https://stackoverflow.com/questions/48562873/zero-indexed-ar...

Why can't each array have its own indexing?

You could declare the array as array[0..100] of something resp. array[1..100] of something for 0/1 based, and just as well do array[2..100] of something, or array[-1..100] of something, for -1/2 based.

Or with letters array['a'..'z'] of something,

Or with enums. TEnum = (red, blue, green), and then an array array[red..green] of something

Actually, that is just Pascal

This would all be fine until you start passing and returning these arrays and indexes between independently implemented functions and modules. Then you'd have to start declaring an index type tied to a specific array or a way to look up the index base for a passed in array.
Which would be fine actually for the program itself, if this is all figured out by the compiler/at runtime, though obviously more inefficient "for no good reason".

It would wreak havoc on the programming side though. Have you ever worked on a code base with more than a few people? Have you ever had to deal with different preferences programmers have? Differences in opinion between smart, slightly (or not so slightly) non-social people?

Using 0 or 1 indexed arrays, the new space vs tabs war! Or the new 2 spaces vs. 4 spaces vs. 8 spaces vs. using tabs war!

Ada lets you use any integer subtype as an array range. The first index can start anywhere you wish. It's string library defaults to 1-based but you can easily interoperate with strings using a different subtype. Code can be written that is base agnostic by using the attribute mechanism or by aliasing arrays to use a known range.
I remember some version of BASIC from the early 80's. You could create real multi-denominational arrays including sparse ones with any starting index(s) you wanted.

Matrix operations were first class.

I curious, can you remember which ones?
I think it was a SDS Basic or a variant.

https://en.wikipedia.org/wiki/SDS_BASIC#MAT_commands

OK, thanks. I've not used that. The problem I had was that I learned Fortran before coming across BASIC, so converting Fortran programs was often a problem when features where missing.
Visual basic has that. Pascal lets you specify any index range.
My dumb guess is that whereas Fortran was kinda designed with scientific computation as a first class use case and c/c++ let you do anything with memory (and make it easy to use inline assembly to access specific instructions), most other languages are designed as general purpose language without scientific (as in "high performance") computing as a main design objective.

And thus specialized data type, operators and syntax look like overhead. And thus language designers leave it out.

One interesting wrinkle is that traditionally (dating back to FORTRAN IV at least), Fortran compilers store matrices in RAM address space using column-major order, where consecutive elements are in the same column, not the same row. [1]

Most languages that prioritize Fortran code interop also adopt column-major order, but most other languages that support multidimensional arrays do row-major order. I'm not sure why Fortran went column-major but because it did, a lot of libraries designed for Fortran callers (such as LAPACK and all BLAS implementations) need to be told that input arrays have been transposed when they come from languages like C++.

[1] https://en.wikipedia.org/wiki/Row-_and_column-major_order

That’s one reason why APL and its descendants are so powerful in the hands of people who have become fluent in viewing computation through the lens of arrays.
Because the only place you really need them is in mathematical/scientific computing, and while many scientists tend to pick up a little programming for their research, the vast majority of computer scientists and programmers, in particular the ones with the interest and ability to work on languages, are not concerned with mathematical modeling/processing. It is a fairly rare interdisciplinary combination.
Plenty of DSA work has more optimal representation and evaluation as matrix arithmetic. Particularly graph traversals and transforms which are heavily used in optimizing compilers.
I think you just proved the parent's point: Most programmers working on 'business software' do not need this as a built-in feature of the language.
This is a great question and I hope you get an answer to this.

The number of woeful ad-hoc solutions I've seen to people handling matrix data in Java/C# for what should've been otherwise very basic analysis ...

Something with pandas-like capability in a lower level language would be amazing.

Something with pandas-like capability in a lower level language would be amazing.

More like numpy than pandas but I've used x-tensor (c++) with great success in several projects: https://xtensor.readthedocs.io/en/latest/index.html

Every time I was forced to pandas I miss R data.frame and data.table.
Java has faster multi dimensional array than numpy, e.g nd4j can trivially leverage Intel MKL or even offload to CUDA.
C# has multidimensional arrays. It's not Pandas or numpy, but it's something.