Hacker News new | ask | show | jobs
by thaumaturgy 3004 days ago
I had been under the impression that you could just feed text into neural nets, and then ... magic!

But, no. As it turns out, the very first problem you encounter when trying to implement ML on text is that you need to transform the text into some set of numbers (the "vectors"), with the elements in the set matching the number of nodes in your input layer.

This is a tricky thing to do. You're essentially trying to "hash" the text in a way which uniquely represents the text you're working with and also gives the neural net something it can operate on. Which is to say, you can't just use a common hashing algorithm, because the neural net won't be able to learn anything from the random output of the hashing algorithm.

There are several different approaches being used for this. One of them, mentioned elsethread, is "bag-of-words", where you build a big dictionary of word-to-number associations and then do some variety of transformations on that. Another is "feature extraction", where you might try to input a value representing properties like the length of the sentence, the number of words, the vocabulary level of the words, and so on. (This would probably be a bad approach for most ML goals on long text.)

This paper presents another approach.