Hacker News new | ask | show | jobs
by mlthoughts2018 2867 days ago
While I certainly concede that dynamic typing will have painpoints like this, I just think on balance they create far fewer problems than the maintenance and inflexibility of type system enforcement patterns.

That said, I find your particular example with int64 extremely hard to believe. I assume you’re using numpy or ctypes to get a fixed precision integer, in which case it should be extremely easy to guarantee no precision changes, and e.g. almost all operations between np.int64 and np.int32 or a Python infinite precision int will preserve the most restrictive type (highest fixed precision) in the operation.

I work in numerical linear algebra and data analytics and have used Python and Cython for years, often caring about precision issues— and have literally never encountered a situation where it was hard to verify what happens with precision.

Unless you’re using some non-numpy custom int64 type that has bizarre lossy semantics, it is quite hard to trigger loss of precision. And even then, a solution using numpy precision-maintaining conventions will be better and easier than some heavy type enforcement.

1 comments

I will agree about the 'on the balance' in the context of speed of prototyping and interactive sessions.

When rubber is about to hit the road, i.e. near deployment with money at stake, I would have love an option to freeze the types, at least in many places. Cython comes in handy, but its clunky and its syntax and semantics is not super obvious to a beginner (I am no longer one, but I remember my days of confusion regarding cyimporting std headers, python headers, how do you use python arrays (not numpy arrays) etc etc).

I am curious, have you put money at stake supported only by dynamic types ?

Regarding int32 vs int64, its not a precision issue its about sparse matrices with more than 1<<31 nonzeros. I am equally surprised that you have not run into this given your practical experience with matrices.

My case involves more than just numpy. There's hdf5, scipy.sparse, some memory mapped arrays and of course numpy.

Given the amount of time I spent to debug this, I would have killed for static type checks.

I happen to use scipy sparse csc and csr matrices for huge sparse tfidf data at work, but never encountered this (we have a numba utility function for operations we do directly on the data, indices, and indptr internal arrays, including counting).

But I do see that counting nnz boils down to a call to np.count_nonzero, which treats bools as np.intp, which is either going to be int32 or int64 (very weird that it chooses signed types), then calls np.sum.

The best solution would be to use np.seterr to warn exactly at call sites with int32 overflow, but amazingly, there seems to be an open numpy issue saying that seterr is not guaranteed for sum.

I do think seterr + logging would be better for this than roping in static typing everywhere just to get a one off benefit like this.

But thats just Numpy. As I mentioned the logic flows through other components too. I am guessing your nnzs are medium sized and hasnt hit 2 billion yet.

Quick question, when you create a scipy.csr how do you ensure the subsequent multiplication operator falls back to C code that uses int64 to index the internals and not int32. I thought if indices array was a int64 array it would do the job. I was wrong. Anyway, even if that had worked it would still have fallen short of ensuring. If it worked, it just happened to work -- thats an anecdote.

If one had static typechecks one would not have to read through all the layers to be sure. Compile error, if any, would have told me.

We also cant directly use scipy.sparse because we dont have that much RAM on these machines. We do use scipy.sparse but they operate internally with memory mapped arrays. Now, depending on the platform memory mapped arrays can be limited to an index of 1<<31. So we have to be extra careful what type is used for indexing in the native libraries that these layers are a wrappers over.

BTW its far from a one off benefit. This was just one of the examples fresh in my memory. It directly affects real money. There you dont want to ship code that could have bugs that can cost you. Static types help rule out these cases once for all. With run time checks it is very hard to be sure that you have caught all of the code paths that can have these mismatches.

I agree that in grad school its different :) One can play fast and loose. Even more, if research is not expected to be reproducible -- that would be pure science.

Our nnz is certainly far greater than 2 billion. The matrix size is around 150 million rows by around 1.7 million columns. We just accumulate the count with a python integer.

I don’t know what you mean by “that’s just numpy” though — since even if this flows through other systems, tracking it at the source in numpy would be obvious.

“Static types help rule out these cases..” — I just disagree. That is what’s advertised, but it’s just not true. Years of working in Scala for very heavy enterprise production systems has made me realize it’s a very false promise. There are actually remarkably few classes of these errors that are removed by static type enforcement, and perfectly good patterns to deal with it in dynamic type situations.

If static typing was free, then sure, why not. But instead it’s hugely costly and kills a lot of productivity, rather than the promise that it improves productivity over time by accumulating compounding type safety benefits.

I think a good rule of thumb is that anything that causes you to need to write more code will be worse in the long run. There’s no guarantee you’ll actually face fewer future bugs with static typing and visibility noise in the code, but you can guarantee it adds more to your maintenance costs, compile times, and complexity of refactoring.

I guess Python’s gradual typing is a good compromise, since you don’t have to choose between zero type safety or speculative all-in type safety where the maintenance overhead almost always outweighs the benefits (rendering it a huge and unreconcilable form of premature optimization).

You can only add it in those few, rare places where there is demonstrated evidence that the static typing optimization actually has a payoff.

> since even if this flows through other systems, tracking it at the source in numpy would be obvious.

You cant possibly be saying that ! even if one assumes that source is numpy.

Regarding the rest, lets say my experience with Ocaml has been more gratifying than yours with Scala.

> We just accumulate the count with a python integer.

That wont help when you are using scipy.sparse for sparse on sparse multiplication, because the multiplications fall back to C code. You have to ensure that it falls back to C code that uses Int64 for indexing the arrays. I am sure you are not saying that you do sparse multiplications of this size in pure python.

Our differences in tastes aside, you seem to work on interesting stuff. Would love exchanging notes in case we run into each other one day. Should be fun.

> “You have to ensure that it falls back to C code that uses Int64 for indexing the arrays. I am sure you are not saying that you do sparse multiplications of this size in pure python.”

For csc and csr matrices at least, these operations typically iterate the underlying indices, indptr and data arrays, and csc `nonzero` uses len(indices), which both relies on (eventually) the C-level call to malloc that defined `indices` (and so uses the systems address space precision, and would never report number of elements in a lower precision int than what the platform supports for memory addressing), and returns this as an infinite precision Python int. Afterwards it only uses arrays of indices, not integers holding sizes.

Long story short is that at least for csc matrices, the issue you describe wouldn’t be possible internally to scipy’s C operations, as you’d always be dealing with an integer type large enough for any possible contiguous array length that can be requested on that platform (and the nonzero items are stored in contiguous arrays under the hood).

On my team we are not doing pure Python ops on the sparse matrices, rather we needed customized weighted operations (for a sparse search engine representation that weights bigrams, trigrams, trending elements, etc., in customized ways) and some set operations to filter rows out of sparse matrices.

So we basically rip the internal representation (data, indices, and indptr) out of csc matrices and pass them into a toolkit of numba functions that we have spent time optimizing.