|
|
|
|
|
by srean
2863 days ago
|
|
> since even if this flows through other systems, tracking it at the source in numpy would be obvious. You cant possibly be saying that ! even if one assumes that source is numpy. Regarding the rest, lets say my experience with Ocaml has been more gratifying than yours with Scala. > We just accumulate the count with a python integer. That wont help when you are using scipy.sparse for sparse on sparse multiplication, because the multiplications fall back to C code. You have to ensure that it falls back to C code that uses Int64 for indexing the arrays. I am sure you are not saying that you do sparse multiplications of this size in pure python. Our differences in tastes aside, you seem to work on interesting stuff. Would love exchanging notes in case we run into each other one day. Should be fun. |
|
For csc and csr matrices at least, these operations typically iterate the underlying indices, indptr and data arrays, and csc `nonzero` uses len(indices), which both relies on (eventually) the C-level call to malloc that defined `indices` (and so uses the systems address space precision, and would never report number of elements in a lower precision int than what the platform supports for memory addressing), and returns this as an infinite precision Python int. Afterwards it only uses arrays of indices, not integers holding sizes.
Long story short is that at least for csc matrices, the issue you describe wouldn’t be possible internally to scipy’s C operations, as you’d always be dealing with an integer type large enough for any possible contiguous array length that can be requested on that platform (and the nonzero items are stored in contiguous arrays under the hood).
On my team we are not doing pure Python ops on the sparse matrices, rather we needed customized weighted operations (for a sparse search engine representation that weights bigrams, trigrams, trending elements, etc., in customized ways) and some set operations to filter rows out of sparse matrices.
So we basically rip the internal representation (data, indices, and indptr) out of csc matrices and pass them into a toolkit of numba functions that we have spent time optimizing.