|
|
|
|
|
by mlthoughts2018
2866 days ago
|
|
The code you linked actually seems to refute your claim of this precision error, at least for multiply, because it is using npy_intp for nnz, which will be int64 on a 64 bit platform, and there is even an overflow check below! Can you post a gist or link some other concrete example to show how it can overflow the intp type based on large nnz? Reading the code, it looks like this could not happen. (Note that the entire second step function wouldn’t have this problem, because it’s accessing indices inside the other arrays, after nnz has already been computed, and is not looping over a variable that would overflow, apart from nnz from the first function, which I pointed out above seems not to overflow unless you’re compiling things in a non-standard way that affects npy_intp). I don’t know what your comments about malloc having nothing to do with it are though. That is how numpy arrays possess their post-allocation result for __len__, such as for indices, indptr and data in csr. So __len__ could not overflow an int type (since it requires the platform address space’s int type to allocate underlying contiguous arrays and returns a Python integer). |
|
I have mentioned earlier that it is not about precision but about index space. I don't think it's going to be productive use of my time to continue this thread.
One specific reason you are getting confused is because you are looking at function calls in isolation not the entire chain of calls through the different Python ecosystem libraries. The problem is the indptr and indices arrays that begin their life as int64 arrays get transformed into int32 arrays in specific code paths.
By stating that malloc is not relevant I mean its not relevant in this particular instance. By the time control reaches malloc the type mismatch damage has laready taken place.
Getting a runtime error is far from the end of the matter even if in certain cases we do get runtime errors. What static types saves the user from is the hunting needed to find out where in the chain of functions are we losing the type invariant we need.
Stopping such bugs is a no-brainer with static types. You claimed at one point up-streams [0] that type systems cannot rule out such errors. If you believe that, this discussion is a waste of time. That's one of the lowest forms of errors a type-system prevents. Your comments like these make me doubt your grasp over these things.
BTW, not sure if you believe large rows and cols imply large nnz [1]. That's not how sparse works.
Given your handle I would have expected you to be familiar, this is bread and butter stuff in day to day ML. On the other hand if your background was stats I would expect less of computational nitty gritties. Nothing wrong with that they focus on different but important aspects.
If you really care I would encourage you to track the flow of code from csr creation in scipy. sparse using memory mapped arrays of indptr, indices and nnz to the C code that will get invoked on two such objects, carefully. The key word here is carefully. There is no nonstandard compilation because there is no compilation. Its about dispatch to the correct C function.
You seem to believe that on a 64 bit platform such indexing error will not happen. That's patently false because it happened many times.
In other words, you are saying your ill conceived and incompletely considered notion of correctness are more correct and than test cases that fail.
This exactly where a static type system would have helped. Those ill conceived incomplete understanding would have been replaced by a proof that proper types have been maintained over the possible flows of control. In this case it would have saved me a lot of time tracking cases where int64 is dropping down to int32.
At this point I would stop engaging in this conversation because it has become an exercise in pointless dogma.
If you refuse to accept that runtime errors detected or undetected dont have a cost, or that static types can mitigate such costs, -- whatever rocks your boat. What I am claiming is that several times in my Python/Cython use I hit instances where static types would have saved a lot of trouble and time and money.
Another common type related problem happens when you need to ensure things remain float32 and do not get promoted to float64. I work both on the large and in the small, so I encounter these.
[0] https://news.ycombinator.com/item?id=17789837
[1] https://news.ycombinator.com/item?id=17789837