Hacker News new | ask | show | jobs
by mlthoughts2018 2862 days ago
Our nnz is certainly far greater than 2 billion. The matrix size is around 150 million rows by around 1.7 million columns. We just accumulate the count with a python integer.

I don’t know what you mean by “that’s just numpy” though — since even if this flows through other systems, tracking it at the source in numpy would be obvious.

“Static types help rule out these cases..” — I just disagree. That is what’s advertised, but it’s just not true. Years of working in Scala for very heavy enterprise production systems has made me realize it’s a very false promise. There are actually remarkably few classes of these errors that are removed by static type enforcement, and perfectly good patterns to deal with it in dynamic type situations.

If static typing was free, then sure, why not. But instead it’s hugely costly and kills a lot of productivity, rather than the promise that it improves productivity over time by accumulating compounding type safety benefits.

I think a good rule of thumb is that anything that causes you to need to write more code will be worse in the long run. There’s no guarantee you’ll actually face fewer future bugs with static typing and visibility noise in the code, but you can guarantee it adds more to your maintenance costs, compile times, and complexity of refactoring.

I guess Python’s gradual typing is a good compromise, since you don’t have to choose between zero type safety or speculative all-in type safety where the maintenance overhead almost always outweighs the benefits (rendering it a huge and unreconcilable form of premature optimization).

You can only add it in those few, rare places where there is demonstrated evidence that the static typing optimization actually has a payoff.

1 comments

> since even if this flows through other systems, tracking it at the source in numpy would be obvious.

You cant possibly be saying that ! even if one assumes that source is numpy.

Regarding the rest, lets say my experience with Ocaml has been more gratifying than yours with Scala.

> We just accumulate the count with a python integer.

That wont help when you are using scipy.sparse for sparse on sparse multiplication, because the multiplications fall back to C code. You have to ensure that it falls back to C code that uses Int64 for indexing the arrays. I am sure you are not saying that you do sparse multiplications of this size in pure python.

Our differences in tastes aside, you seem to work on interesting stuff. Would love exchanging notes in case we run into each other one day. Should be fun.

> “You have to ensure that it falls back to C code that uses Int64 for indexing the arrays. I am sure you are not saying that you do sparse multiplications of this size in pure python.”

For csc and csr matrices at least, these operations typically iterate the underlying indices, indptr and data arrays, and csc `nonzero` uses len(indices), which both relies on (eventually) the C-level call to malloc that defined `indices` (and so uses the systems address space precision, and would never report number of elements in a lower precision int than what the platform supports for memory addressing), and returns this as an infinite precision Python int. Afterwards it only uses arrays of indices, not integers holding sizes.

Long story short is that at least for csc matrices, the issue you describe wouldn’t be possible internally to scipy’s C operations, as you’d always be dealing with an integer type large enough for any possible contiguous array length that can be requested on that platform (and the nonzero items are stored in contiguous arrays under the hood).

On my team we are not doing pure Python ops on the sparse matrices, rather we needed customized weighted operations (for a sparse search engine representation that weights bigrams, trigrams, trending elements, etc., in customized ways) and some set operations to filter rows out of sparse matrices.

So we basically rip the internal representation (data, indices, and indptr) out of csc matrices and pass them into a toolkit of numba functions that we have spent time optimizing.

Lets not weasel with 'typically'.

The code that will get called for a multiply is this https://github.com/scipy/scipy/blob/master/scipy/sparse/spar... and https://github.com/scipy/scipy/blob/master/scipy/sparse/spar...

It's important that decisions at the python level trickles down to the correct choice when it comes down to this level.

On a 64 bit architecture one would expect that using 64bit int arrays for indices and indptr would ensure that. But thats not the way it works. We regularly encountered cases where it would call the code corresponding to int32. I know why and have special checks and jump hoops to prevent this.

Thats besides the point, with static types I wouldn't need to do this, the compiler would take care of it.

I appreciate your effort to dig through the logic. You have spent time speaking at length in the comment above but unfortunately said little. Malloc has nothing to do with it. Your third paragraph is manifestly false. Why do I say so ? Because I deal with this everyday and have counterexamples.

I didnt mean to ask you to find out. Apologies if I wasted your time. I already know why the type mismatch happens. My point was to demonstrate that a lot of manual wading is needed to ensure that it finally bottoms out by calling native code with correct type.

The code you linked actually seems to refute your claim of this precision error, at least for multiply, because it is using npy_intp for nnz, which will be int64 on a 64 bit platform, and there is even an overflow check below!

Can you post a gist or link some other concrete example to show how it can overflow the intp type based on large nnz? Reading the code, it looks like this could not happen.

(Note that the entire second step function wouldn’t have this problem, because it’s accessing indices inside the other arrays, after nnz has already been computed, and is not looping over a variable that would overflow, apart from nnz from the first function, which I pointed out above seems not to overflow unless you’re compiling things in a non-standard way that affects npy_intp).

I don’t know what your comments about malloc having nothing to do with it are though. That is how numpy arrays possess their post-allocation result for __len__, such as for indices, indptr and data in csr. So __len__ could not overflow an int type (since it requires the platform address space’s int type to allocate underlying contiguous arrays and returns a Python integer).

Can't help but say this, you are seriously confused. Not necessarily your fault, as obviously you dont have the full code.

I have mentioned earlier that it is not about precision but about index space. I don't think it's going to be productive use of my time to continue this thread.

One specific reason you are getting confused is because you are looking at function calls in isolation not the entire chain of calls through the different Python ecosystem libraries. The problem is the indptr and indices arrays that begin their life as int64 arrays get transformed into int32 arrays in specific code paths.

By stating that malloc is not relevant I mean its not relevant in this particular instance. By the time control reaches malloc the type mismatch damage has laready taken place.

Getting a runtime error is far from the end of the matter even if in certain cases we do get runtime errors. What static types saves the user from is the hunting needed to find out where in the chain of functions are we losing the type invariant we need.

Stopping such bugs is a no-brainer with static types. You claimed at one point up-streams [0] that type systems cannot rule out such errors. If you believe that, this discussion is a waste of time. That's one of the lowest forms of errors a type-system prevents. Your comments like these make me doubt your grasp over these things.

BTW, not sure if you believe large rows and cols imply large nnz [1]. That's not how sparse works.

Given your handle I would have expected you to be familiar, this is bread and butter stuff in day to day ML. On the other hand if your background was stats I would expect less of computational nitty gritties. Nothing wrong with that they focus on different but important aspects.

If you really care I would encourage you to track the flow of code from csr creation in scipy. sparse using memory mapped arrays of indptr, indices and nnz to the C code that will get invoked on two such objects, carefully. The key word here is carefully. There is no nonstandard compilation because there is no compilation. Its about dispatch to the correct C function.

You seem to believe that on a 64 bit platform such indexing error will not happen. That's patently false because it happened many times.

In other words, you are saying your ill conceived and incompletely considered notion of correctness are more correct and than test cases that fail.

This exactly where a static type system would have helped. Those ill conceived incomplete understanding would have been replaced by a proof that proper types have been maintained over the possible flows of control. In this case it would have saved me a lot of time tracking cases where int64 is dropping down to int32.

At this point I would stop engaging in this conversation because it has become an exercise in pointless dogma.

If you refuse to accept that runtime errors detected or undetected dont have a cost, or that static types can mitigate such costs, -- whatever rocks your boat. What I am claiming is that several times in my Python/Cython use I hit instances where static types would have saved a lot of trouble and time and money.

Another common type related problem happens when you need to ensure things remain float32 and do not get promoted to float64. I work both on the large and in the small, so I encounter these.

[0] https://news.ycombinator.com/item?id=17789837

[1] https://news.ycombinator.com/item?id=17789837

> “I have mentioned earlier that it is not about precision but about index space.”

Right, and if you read in my comments it shows I have also been only talking about the index space as well, where int64 vs int32 is a matter of int precision for representing large amounts of indices, but where npy_intp will be of the higher precision (to match the platform’s address space) and will not be able to suffer the overflow issue you described unless it’s a custom compilation of numpy defining npy_intp as int32 even on a 64-bit system (which you seem confused about by repeatedly saying compilation isn’t a part of it, as if anyone is suggesting your personal workflow involves compiling anything, when I’m talking about how the numpy you have installed was compiled. If it’s standard numpy compiled for a 64-bit system, then the evidence suggests your claim is just wrong.)

You claim that indptr and indices arrays are silently converted from int64 to int32 on 64-bit platforms but you offer no evidence. You just keep saying that it happened to you, despite the actual code you linked indicating that it couldn’t happen. And I do actually work with indptr and indices arrays with tens of billions of nonzero elements in an Alexa top 300 website search engine every day, and have never encountered any such silent type conversion.

Given that this example of index int precision actually seems unfounded in the code, it just doesn’t seem relevant to any sort of static vs dynamic typing debate. There’s no such issue here that static typing would help with, because it’s clearly not causing the problem you think it’s causing in the dynamic typing code.