Hacker News new | ask | show | jobs
by pg314 3209 days ago
> That's exactly the same as with numpy. I'm not sure what your point is.

I was replying to "there's a reason why...". You didn't specify that reason, so from the rest of your comment I took it to mean that Python (with numpy) was fast and good enough to write deep learning stuff. That doesn't seem to be the case for TensorFlow.

> I'm curious what kind of data you're working with that can't be represented and effectively transformed in a tensor (numpy array).

I'm not intimately familiar with the internals of numpy, but my understanding is that the basic data structure is a (multi-dimensional) array of values (not pointers). That leads to a number of questions.

If you have an array of records (dtype objects), and one of the fields is a string, am I correct that each element needs to allocate memory to hold the longest possible value that can occur for that field? What if that is not known beforehand?

How do you deal with optional fields (e.g. int or null)? Do you need to add a separate boolean to indicate null?

How do you deal with union types, e.g. each record can be one of x types, do you make a record that has a field for each of the fields of those x types? Do those fields take up space?

1 comments

>You didn't specify that reason, so from the rest of your comment I took it to mean that Python (with numpy) was fast and good enough to write deep learning stuff. That doesn't seem to be the case for TensorFlow.

Tensorflow tensors are numpy arrays, or are transparently viewable as such.

>If you have an array of records (dtype objects), and one of the fields is a string, am I correct that each element needs to allocate memory to hold the longest possible value that can occur for that field? What if that is not known beforehand?

Yes, although you can also store numpy arrays of pyobjects, which are arrays of pointers. You'll be able to vectorize the code, but you won't get the same performance improvements as with a normal numpy array, because that same level of performance isn't possible with an array of pointers.

Note that for most machine learning applications, you'd preprocess your string into a vector of some kind.

>How do you deal with optional fields (e.g. int or null)? Do you need to add a separate boolean to indicate null?

Yes, but I'm not sure when you'd do that. That is, again in most machine learning applications you'd be representing things as one-hot arrays or as some kind of compressed high dimensional position vector, where 0 would represent a lack of presence of some thing.

>How do you deal with union types

dt = np.dtype((np.int32,{'real':(np.int16, 0),'imag':(np.int16, 2)})

is a 32 bit int that can also be accessed as a 16 bit complex number via .real and .imag.