Hacker News new | ask | show | jobs
by mattkrause 3210 days ago
You're violently agreeing with each other.

Python itself can be pretty slow. Doing image processing on data stored as list-of-lists-of-integers would be brutally slow.

On the other hand, numpy is an import away, and it can be quite fast, especially if it's been built with an optimized BLAS/ATLAS, etc.

1 comments

By blazingly fast you mean 100x slower than C++ equivalent and only 20x slower is you're very careful to avoid accidental copies.

For reference, MATLAB is about 30x slower with no special care. Pure Java on Hotspot was 5x slower except it dies on big data input due to very slow GC and goes to 50x slow.

Source: handled big audio data from hdf5 database, gigabytes sized. C++ equivalent had no vectorization or magic BLAS or anything.

As I'll often say to these comments, then you're doing things wrong. Numpy code can be written to never leave the numpy sandbox, and at that point it should be as fast or faster than naive c++ (because you'll be getting SSE and stuff for free).

There's a reason almost all deep learning is done in python.

The first time you have to return into python all the gains evaporate. As I said, we improved python code a few times and it never even got near. What it did was a bunch of convolutions, dot products, multiplications, fft, gamma probability equations. Even vector comparisons used numpy. Used direct hdf5 interface numpy has and fast view interface for overlaps.

Numba choked on the code by the way and crashed unless views were removed. And then it produced much worse performance.

Not all data is a good fit for Numpy: some data is non-numeric or not a homogenous array.

> There's a reason almost all deep learning is done in python.

The heavy-lifting in e.g. TensorFlow is done in C++. Bindings to Python make sense because it is one of the few sanctioned languages inside Google, and it is widely used outside of Google and easy to pick up.

>The heavy-lifting in e.g. TensorFlow is done in C++. Bindings to Python make sense because it is one of the few sanctioned languages inside Google, and it is widely used outside of Google and easy to pick up.

That's exactly the same as with numpy. I'm not sure what your point is. C++ is also one of the few sanctioned languages inside google, as is Java.

>Not all data is a good fit for Numpy: some data is non-numeric or not a homogenous array.

I'm curious what kind of data you're working with that can't be represented and effectively transformed in a tensor (numpy array).

> That's exactly the same as with numpy. I'm not sure what your point is.

I was replying to "there's a reason why...". You didn't specify that reason, so from the rest of your comment I took it to mean that Python (with numpy) was fast and good enough to write deep learning stuff. That doesn't seem to be the case for TensorFlow.

> I'm curious what kind of data you're working with that can't be represented and effectively transformed in a tensor (numpy array).

I'm not intimately familiar with the internals of numpy, but my understanding is that the basic data structure is a (multi-dimensional) array of values (not pointers). That leads to a number of questions.

If you have an array of records (dtype objects), and one of the fields is a string, am I correct that each element needs to allocate memory to hold the longest possible value that can occur for that field? What if that is not known beforehand?

How do you deal with optional fields (e.g. int or null)? Do you need to add a separate boolean to indicate null?

How do you deal with union types, e.g. each record can be one of x types, do you make a record that has a field for each of the fields of those x types? Do those fields take up space?

>You didn't specify that reason, so from the rest of your comment I took it to mean that Python (with numpy) was fast and good enough to write deep learning stuff. That doesn't seem to be the case for TensorFlow.

Tensorflow tensors are numpy arrays, or are transparently viewable as such.

>If you have an array of records (dtype objects), and one of the fields is a string, am I correct that each element needs to allocate memory to hold the longest possible value that can occur for that field? What if that is not known beforehand?

Yes, although you can also store numpy arrays of pyobjects, which are arrays of pointers. You'll be able to vectorize the code, but you won't get the same performance improvements as with a normal numpy array, because that same level of performance isn't possible with an array of pointers.

Note that for most machine learning applications, you'd preprocess your string into a vector of some kind.

>How do you deal with optional fields (e.g. int or null)? Do you need to add a separate boolean to indicate null?

Yes, but I'm not sure when you'd do that. That is, again in most machine learning applications you'd be representing things as one-hot arrays or as some kind of compressed high dimensional position vector, where 0 would represent a lack of presence of some thing.

>How do you deal with union types

dt = np.dtype((np.int32,{'real':(np.int16, 0),'imag':(np.int16, 2)})

is a 32 bit int that can also be accessed as a 16 bit complex number via .real and .imag.