Hacker News new | ask | show | jobs
by dahart 737 days ago
The thing I want most is a more sane and more memorable way to compose non-element-wise operations. There are so many different ways to build views and multiply arrays that I can’t remember them and never know which to use, and have to relearn them every time I use numpy… broadcasting, padding, repeating, slicing, stacking, transposing, outers, inners, dots of all sorts, and half the stack overflow answers lead to the most confusing pickaxe of all: einsum. Am I alone? I love numpy, but every time I reach for it I somehow get stuck for hours on what ought to be really simple indexing problems.
8 comments

When I started out I was basically stumbling around for code that worked. Things got a lot easier for me once I sat down and actually understood broadcasting.

The rules are: 1) scalars always broadcast, 2) if one vector has fewer dimensions, left pad it with 1s and 3) starting from the right, check dimension compatibility, where compatibility means the dimensions are equal or one of them is 1. Example: np.ones((2,3,1)) * np.ones((1,4)) = np.ones((2,3,4))

Once your dimensions are correct, it's a lot easier to reason your way through a problem, similar to how basic dimensional analysis in physics can verify your answer makes some sense.

(I would disable broadcasting if I could, since it has caused way too many silent bugs in my experience. JAX can, but I don't feel like learning another library to do this.)

Once I understood broadcasting, it was a lot easier to practice vectorizing basic algorithms.

The broadcasting doc is surprisingly readable and easy to follow. And the rules are surprisingly simple. The diagrams and examples are excellent. https://numpy.org/doc/stable/user/basics.broadcasting.html

After taking the time to work through that doc and ponder some real-world examples, I went from being very confused by broadcasting to employing intermediate broadcasting techniques in a matter of weeks. Writing out your array dimensions in the same style of their examples (either in a text file or on a notepad) is the key technique IMO:

  Image  (3d array): 256 x 256 x 3
  Scale  (1d array):             3
  Result (3d array): 256 x 256 x 3
And of course with practice you can do it in your head.
Please please don't put that in your head or your notepad. This is what code comments are for!
You definitely should "annotate" arrays with their expected sizes, but if you're doing a lot of broadcasting operations it can get pretty verbose to write out those tables over and over.

That said, yes, you definitely should at least make an attempt to clarify your broadcasting logic if you want to be able to read your own scripts in a month from now, let alone write maintainable production code.

Explicit internal broadcasting (by adding an axis with an explicit `indefinite` size instead of `1`) would be so much simpler to reason about.

Unfortunately there is far too much existing code and python is not type-safe.

You can do this with `np.newaxis` - in the NumPy course I wrote as TA we required students to always be explicit about the axes (also in e.g. sums). It would be nice if you could disable the implicit broadcasting, but as you mention that would break so much code
`np.newaxis` explicitly adds a `1` dimension, not an `indefinite` one.
To be honest einsum is the easiest one. You get fine control on which axis matmul to which. But I wish it can do more than matmul.

The others are just messy shit. Like you got np.abs but no arr.abs, np.unique but no arr.unique. But now you have arr.mean.

Sometimes you got argument name index, sometimes indices, sometimes accept (list, tuple), sometime only tuple.

https://github.com/mcabbott/Tullio.jl is my favorite idea for extending einsum notation to more stuff. Really hope numpy/torch get something comparable!
Yeah, Tullio.jl is a great package. Basically, it is just a macro for generating efficient for loops.

I guess, it might be hard to achieve similar feature in Python without metaprogramming.

+1 been at your talk presenting it at a Retreat. Happy User
Ah, I'm not the creator just an admirer hehe
It gets more comfortable over time, but I remember feeling that way for the first year or three. My wishlist now is for most of numpy to just be a really great einsum implementation, along with a few analogous operations for the rest of the map-reduces numpy accelerates.

I've been writing my own low-level numeric routines lately, so I'm not up-to-date on the latest news, but there have been a few ideas floating around over the last few years about naming your axes and defining operations in terms of those names [0,1,2,3]. That sort of thing looks promising to me, and one of those projects might be a better conceptual fit for you.

[0] https://nlp.seas.harvard.edu/NamedTensor

[1] https://pypi.org/project/named-arrays/

[2] https://pytorch.org/docs/stable/named_tensor.html

[3] https://docs.xarray.dev/en/stable/

I want to second the idea of broadcasting based on named axes/dimensions. I think it's a logical next step in the evolution of the array programming paradigm.

I particularly recommend checking out xarray. It has made my numpy-ish code like 90% shorter and it makes it trivial to juggle six+ dimensional arrays. If your data is on a grid (not shaped like a table/dataframe), I see no downsides to using xarray instead of bare numpy.

ChatGPT is really good at this. Using it to solve numpy and matplotlib problems is worth the cost of the subscription.
What a waste of electricity. All because numpy and pandas did not get their APIs right. My hot take.
I'd say you're both right, based on my limited experience. I can't seem to do much in numpy or pandas without resorting to ChatGPT or a notebook-based assistant. I assume that need will go away with additional experience, but maybe that's being overoptimistic.

I wish ChatGPT had been around when I learned C. It would sure have saved the programmers in my neighboring offices a lot of grief.

Most of the bugs I got on numpy programs came from a variable with a different ndims as expected being broadcast implicitly.

Implicit type casting is considered a mistake in most programming languages; if I were to redesign numpy from scratch I would make all broadcasting explicit.

My solution to these problems is asserting an array's shape often. Does anybody know is there's a tool like mypy or valgrind, but that checks mismatched array shapes rather than types or memory leaks?

numpy's API was carefully designed to match matlab, where most of its early users came from.
Then we are lucky they didn't decide to start their arrays at 1 :-)
indeed...
Is this the dplyR use case in R? Is there dplyPython for Numpy?
You mean the pipe operator from Magrittr %>%, and the R 4.0 built-in operator |>?

Pandas has a .pipe(fn) method, but without the lazy evaluation to enable the R symbol capturing magic, the syntax is pretty clunky and not particularly useful. The closest approximation is method chaining, which at least is more consistently available in Pandas than in Numpy.

If you're talking about Dplyr "verbs" then no, there's nothing quite like that in Python, but it's much less necessary in Pandas or Polars than in R, because the set of standard tools for working with data frames in the bear libraries is much richer than in the R standard library.

What do you mean by: "more memorable way to ..." ?