Hacker News new | ask | show | jobs
by mlthoughts2018 2868 days ago
I happen to use scipy sparse csc and csr matrices for huge sparse tfidf data at work, but never encountered this (we have a numba utility function for operations we do directly on the data, indices, and indptr internal arrays, including counting).

But I do see that counting nnz boils down to a call to np.count_nonzero, which treats bools as np.intp, which is either going to be int32 or int64 (very weird that it chooses signed types), then calls np.sum.

The best solution would be to use np.seterr to warn exactly at call sites with int32 overflow, but amazingly, there seems to be an open numpy issue saying that seterr is not guaranteed for sum.

I do think seterr + logging would be better for this than roping in static typing everywhere just to get a one off benefit like this.

1 comments

But thats just Numpy. As I mentioned the logic flows through other components too. I am guessing your nnzs are medium sized and hasnt hit 2 billion yet.

Quick question, when you create a scipy.csr how do you ensure the subsequent multiplication operator falls back to C code that uses int64 to index the internals and not int32. I thought if indices array was a int64 array it would do the job. I was wrong. Anyway, even if that had worked it would still have fallen short of ensuring. If it worked, it just happened to work -- thats an anecdote.

If one had static typechecks one would not have to read through all the layers to be sure. Compile error, if any, would have told me.

We also cant directly use scipy.sparse because we dont have that much RAM on these machines. We do use scipy.sparse but they operate internally with memory mapped arrays. Now, depending on the platform memory mapped arrays can be limited to an index of 1<<31. So we have to be extra careful what type is used for indexing in the native libraries that these layers are a wrappers over.

BTW its far from a one off benefit. This was just one of the examples fresh in my memory. It directly affects real money. There you dont want to ship code that could have bugs that can cost you. Static types help rule out these cases once for all. With run time checks it is very hard to be sure that you have caught all of the code paths that can have these mismatches.

I agree that in grad school its different :) One can play fast and loose. Even more, if research is not expected to be reproducible -- that would be pure science.

Our nnz is certainly far greater than 2 billion. The matrix size is around 150 million rows by around 1.7 million columns. We just accumulate the count with a python integer.

I don’t know what you mean by “that’s just numpy” though — since even if this flows through other systems, tracking it at the source in numpy would be obvious.

“Static types help rule out these cases..” — I just disagree. That is what’s advertised, but it’s just not true. Years of working in Scala for very heavy enterprise production systems has made me realize it’s a very false promise. There are actually remarkably few classes of these errors that are removed by static type enforcement, and perfectly good patterns to deal with it in dynamic type situations.

If static typing was free, then sure, why not. But instead it’s hugely costly and kills a lot of productivity, rather than the promise that it improves productivity over time by accumulating compounding type safety benefits.

I think a good rule of thumb is that anything that causes you to need to write more code will be worse in the long run. There’s no guarantee you’ll actually face fewer future bugs with static typing and visibility noise in the code, but you can guarantee it adds more to your maintenance costs, compile times, and complexity of refactoring.

I guess Python’s gradual typing is a good compromise, since you don’t have to choose between zero type safety or speculative all-in type safety where the maintenance overhead almost always outweighs the benefits (rendering it a huge and unreconcilable form of premature optimization).

You can only add it in those few, rare places where there is demonstrated evidence that the static typing optimization actually has a payoff.

> since even if this flows through other systems, tracking it at the source in numpy would be obvious.

You cant possibly be saying that ! even if one assumes that source is numpy.

Regarding the rest, lets say my experience with Ocaml has been more gratifying than yours with Scala.

> We just accumulate the count with a python integer.

That wont help when you are using scipy.sparse for sparse on sparse multiplication, because the multiplications fall back to C code. You have to ensure that it falls back to C code that uses Int64 for indexing the arrays. I am sure you are not saying that you do sparse multiplications of this size in pure python.

Our differences in tastes aside, you seem to work on interesting stuff. Would love exchanging notes in case we run into each other one day. Should be fun.

> “You have to ensure that it falls back to C code that uses Int64 for indexing the arrays. I am sure you are not saying that you do sparse multiplications of this size in pure python.”

For csc and csr matrices at least, these operations typically iterate the underlying indices, indptr and data arrays, and csc `nonzero` uses len(indices), which both relies on (eventually) the C-level call to malloc that defined `indices` (and so uses the systems address space precision, and would never report number of elements in a lower precision int than what the platform supports for memory addressing), and returns this as an infinite precision Python int. Afterwards it only uses arrays of indices, not integers holding sizes.

Long story short is that at least for csc matrices, the issue you describe wouldn’t be possible internally to scipy’s C operations, as you’d always be dealing with an integer type large enough for any possible contiguous array length that can be requested on that platform (and the nonzero items are stored in contiguous arrays under the hood).

On my team we are not doing pure Python ops on the sparse matrices, rather we needed customized weighted operations (for a sparse search engine representation that weights bigrams, trigrams, trending elements, etc., in customized ways) and some set operations to filter rows out of sparse matrices.

So we basically rip the internal representation (data, indices, and indptr) out of csc matrices and pass them into a toolkit of numba functions that we have spent time optimizing.

Lets not weasel with 'typically'.

The code that will get called for a multiply is this https://github.com/scipy/scipy/blob/master/scipy/sparse/spar... and https://github.com/scipy/scipy/blob/master/scipy/sparse/spar...

It's important that decisions at the python level trickles down to the correct choice when it comes down to this level.

On a 64 bit architecture one would expect that using 64bit int arrays for indices and indptr would ensure that. But thats not the way it works. We regularly encountered cases where it would call the code corresponding to int32. I know why and have special checks and jump hoops to prevent this.

Thats besides the point, with static types I wouldn't need to do this, the compiler would take care of it.

I appreciate your effort to dig through the logic. You have spent time speaking at length in the comment above but unfortunately said little. Malloc has nothing to do with it. Your third paragraph is manifestly false. Why do I say so ? Because I deal with this everyday and have counterexamples.

I didnt mean to ask you to find out. Apologies if I wasted your time. I already know why the type mismatch happens. My point was to demonstrate that a lot of manual wading is needed to ensure that it finally bottoms out by calling native code with correct type.

The code you linked actually seems to refute your claim of this precision error, at least for multiply, because it is using npy_intp for nnz, which will be int64 on a 64 bit platform, and there is even an overflow check below!

Can you post a gist or link some other concrete example to show how it can overflow the intp type based on large nnz? Reading the code, it looks like this could not happen.

(Note that the entire second step function wouldn’t have this problem, because it’s accessing indices inside the other arrays, after nnz has already been computed, and is not looping over a variable that would overflow, apart from nnz from the first function, which I pointed out above seems not to overflow unless you’re compiling things in a non-standard way that affects npy_intp).

I don’t know what your comments about malloc having nothing to do with it are though. That is how numpy arrays possess their post-allocation result for __len__, such as for indices, indptr and data in csr. So __len__ could not overflow an int type (since it requires the platform address space’s int type to allocate underlying contiguous arrays and returns a Python integer).