Hacker News new | ask | show | jobs
by srean 2860 days ago
As a long time Python/Cython user I can say that I have sorely missed static types in many occasions, especially for long running tasks. In fact I would sometimes use Cython not for performance but as a type checker.

I can describe a recent example. I had to ensure that an integer is always an int64 as the logic passes through different python modules and libraries. It was an absolute hell to track down all the places where things were dropping down to int32. With static types this would have been a no-brainer. This is not to say that I do not enjoy its dynamic typing where it is appropriate.

Hopefully Python 3 will make things better with optional types. But its still not statically typed, just a pass through a powerful linter.

1 comments

While I certainly concede that dynamic typing will have painpoints like this, I just think on balance they create far fewer problems than the maintenance and inflexibility of type system enforcement patterns.

That said, I find your particular example with int64 extremely hard to believe. I assume you’re using numpy or ctypes to get a fixed precision integer, in which case it should be extremely easy to guarantee no precision changes, and e.g. almost all operations between np.int64 and np.int32 or a Python infinite precision int will preserve the most restrictive type (highest fixed precision) in the operation.

I work in numerical linear algebra and data analytics and have used Python and Cython for years, often caring about precision issues— and have literally never encountered a situation where it was hard to verify what happens with precision.

Unless you’re using some non-numpy custom int64 type that has bizarre lossy semantics, it is quite hard to trigger loss of precision. And even then, a solution using numpy precision-maintaining conventions will be better and easier than some heavy type enforcement.

I will agree about the 'on the balance' in the context of speed of prototyping and interactive sessions.

When rubber is about to hit the road, i.e. near deployment with money at stake, I would have love an option to freeze the types, at least in many places. Cython comes in handy, but its clunky and its syntax and semantics is not super obvious to a beginner (I am no longer one, but I remember my days of confusion regarding cyimporting std headers, python headers, how do you use python arrays (not numpy arrays) etc etc).

I am curious, have you put money at stake supported only by dynamic types ?

Regarding int32 vs int64, its not a precision issue its about sparse matrices with more than 1<<31 nonzeros. I am equally surprised that you have not run into this given your practical experience with matrices.

My case involves more than just numpy. There's hdf5, scipy.sparse, some memory mapped arrays and of course numpy.

Given the amount of time I spent to debug this, I would have killed for static type checks.

I happen to use scipy sparse csc and csr matrices for huge sparse tfidf data at work, but never encountered this (we have a numba utility function for operations we do directly on the data, indices, and indptr internal arrays, including counting).

But I do see that counting nnz boils down to a call to np.count_nonzero, which treats bools as np.intp, which is either going to be int32 or int64 (very weird that it chooses signed types), then calls np.sum.

The best solution would be to use np.seterr to warn exactly at call sites with int32 overflow, but amazingly, there seems to be an open numpy issue saying that seterr is not guaranteed for sum.

I do think seterr + logging would be better for this than roping in static typing everywhere just to get a one off benefit like this.

But thats just Numpy. As I mentioned the logic flows through other components too. I am guessing your nnzs are medium sized and hasnt hit 2 billion yet.

Quick question, when you create a scipy.csr how do you ensure the subsequent multiplication operator falls back to C code that uses int64 to index the internals and not int32. I thought if indices array was a int64 array it would do the job. I was wrong. Anyway, even if that had worked it would still have fallen short of ensuring. If it worked, it just happened to work -- thats an anecdote.

If one had static typechecks one would not have to read through all the layers to be sure. Compile error, if any, would have told me.

We also cant directly use scipy.sparse because we dont have that much RAM on these machines. We do use scipy.sparse but they operate internally with memory mapped arrays. Now, depending on the platform memory mapped arrays can be limited to an index of 1<<31. So we have to be extra careful what type is used for indexing in the native libraries that these layers are a wrappers over.

BTW its far from a one off benefit. This was just one of the examples fresh in my memory. It directly affects real money. There you dont want to ship code that could have bugs that can cost you. Static types help rule out these cases once for all. With run time checks it is very hard to be sure that you have caught all of the code paths that can have these mismatches.

I agree that in grad school its different :) One can play fast and loose. Even more, if research is not expected to be reproducible -- that would be pure science.

Our nnz is certainly far greater than 2 billion. The matrix size is around 150 million rows by around 1.7 million columns. We just accumulate the count with a python integer.

I don’t know what you mean by “that’s just numpy” though — since even if this flows through other systems, tracking it at the source in numpy would be obvious.

“Static types help rule out these cases..” — I just disagree. That is what’s advertised, but it’s just not true. Years of working in Scala for very heavy enterprise production systems has made me realize it’s a very false promise. There are actually remarkably few classes of these errors that are removed by static type enforcement, and perfectly good patterns to deal with it in dynamic type situations.

If static typing was free, then sure, why not. But instead it’s hugely costly and kills a lot of productivity, rather than the promise that it improves productivity over time by accumulating compounding type safety benefits.

I think a good rule of thumb is that anything that causes you to need to write more code will be worse in the long run. There’s no guarantee you’ll actually face fewer future bugs with static typing and visibility noise in the code, but you can guarantee it adds more to your maintenance costs, compile times, and complexity of refactoring.

I guess Python’s gradual typing is a good compromise, since you don’t have to choose between zero type safety or speculative all-in type safety where the maintenance overhead almost always outweighs the benefits (rendering it a huge and unreconcilable form of premature optimization).

You can only add it in those few, rare places where there is demonstrated evidence that the static typing optimization actually has a payoff.

> since even if this flows through other systems, tracking it at the source in numpy would be obvious.

You cant possibly be saying that ! even if one assumes that source is numpy.

Regarding the rest, lets say my experience with Ocaml has been more gratifying than yours with Scala.

> We just accumulate the count with a python integer.

That wont help when you are using scipy.sparse for sparse on sparse multiplication, because the multiplications fall back to C code. You have to ensure that it falls back to C code that uses Int64 for indexing the arrays. I am sure you are not saying that you do sparse multiplications of this size in pure python.

Our differences in tastes aside, you seem to work on interesting stuff. Would love exchanging notes in case we run into each other one day. Should be fun.