|
|
|
|
|
by mlthoughts2018
2868 days ago
|
|
I happen to use scipy sparse csc and csr matrices for huge sparse tfidf data at work, but never encountered this (we have a numba utility function for operations we do directly on the data, indices, and indptr internal arrays, including counting). But I do see that counting nnz boils down to a call to np.count_nonzero, which treats bools as np.intp, which is either going to be int32 or int64 (very weird that it chooses signed types), then calls np.sum. The best solution would be to use np.seterr to warn exactly at call sites with int32 overflow, but amazingly, there seems to be an open numpy issue saying that seterr is not guaranteed for sum. I do think seterr + logging would be better for this than roping in static typing everywhere just to get a one off benefit like this. |
|
Quick question, when you create a scipy.csr how do you ensure the subsequent multiplication operator falls back to C code that uses int64 to index the internals and not int32. I thought if indices array was a int64 array it would do the job. I was wrong. Anyway, even if that had worked it would still have fallen short of ensuring. If it worked, it just happened to work -- thats an anecdote.
If one had static typechecks one would not have to read through all the layers to be sure. Compile error, if any, would have told me.
We also cant directly use scipy.sparse because we dont have that much RAM on these machines. We do use scipy.sparse but they operate internally with memory mapped arrays. Now, depending on the platform memory mapped arrays can be limited to an index of 1<<31. So we have to be extra careful what type is used for indexing in the native libraries that these layers are a wrappers over.
BTW its far from a one off benefit. This was just one of the examples fresh in my memory. It directly affects real money. There you dont want to ship code that could have bugs that can cost you. Static types help rule out these cases once for all. With run time checks it is very hard to be sure that you have caught all of the code paths that can have these mismatches.
I agree that in grad school its different :) One can play fast and loose. Even more, if research is not expected to be reproducible -- that would be pure science.