| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by joshuamorton 3210 days ago

>The heavy-lifting in e.g. TensorFlow is done in C++. Bindings to Python make sense because it is one of the few sanctioned languages inside Google, and it is widely used outside of Google and easy to pick up.

That's exactly the same as with numpy. I'm not sure what your point is. C++ is also one of the few sanctioned languages inside google, as is Java.

>Not all data is a good fit for Numpy: some data is non-numeric or not a homogenous array.

I'm curious what kind of data you're working with that can't be represented and effectively transformed in a tensor (numpy array).

1 comments

pg314 3210 days ago

> That's exactly the same as with numpy. I'm not sure what your point is.

I was replying to "there's a reason why...". You didn't specify that reason, so from the rest of your comment I took it to mean that Python (with numpy) was fast and good enough to write deep learning stuff. That doesn't seem to be the case for TensorFlow.

> I'm curious what kind of data you're working with that can't be represented and effectively transformed in a tensor (numpy array).

I'm not intimately familiar with the internals of numpy, but my understanding is that the basic data structure is a (multi-dimensional) array of values (not pointers). That leads to a number of questions.

If you have an array of records (dtype objects), and one of the fields is a string, am I correct that each element needs to allocate memory to hold the longest possible value that can occur for that field? What if that is not known beforehand?

How do you deal with optional fields (e.g. int or null)? Do you need to add a separate boolean to indicate null?

How do you deal with union types, e.g. each record can be one of x types, do you make a record that has a field for each of the fields of those x types? Do those fields take up space?

joshuamorton 3208 days ago

>You didn't specify that reason, so from the rest of your comment I took it to mean that Python (with numpy) was fast and good enough to write deep learning stuff. That doesn't seem to be the case for TensorFlow.

Tensorflow tensors are numpy arrays, or are transparently viewable as such.

>If you have an array of records (dtype objects), and one of the fields is a string, am I correct that each element needs to allocate memory to hold the longest possible value that can occur for that field? What if that is not known beforehand?

Yes, although you can also store numpy arrays of pyobjects, which are arrays of pointers. You'll be able to vectorize the code, but you won't get the same performance improvements as with a normal numpy array, because that same level of performance isn't possible with an array of pointers.

Note that for most machine learning applications, you'd preprocess your string into a vector of some kind.

>How do you deal with optional fields (e.g. int or null)? Do you need to add a separate boolean to indicate null?

Yes, but I'm not sure when you'd do that. That is, again in most machine learning applications you'd be representing things as one-hot arrays or as some kind of compressed high dimensional position vector, where 0 would represent a lack of presence of some thing.

>How do you deal with union types

dt = np.dtype((np.int32,{'real':(np.int16, 0),'imag':(np.int16, 2)})

is a 32 bit int that can also be accessed as a 16 bit complex number via .real and .imag.