Hacker News new | ask | show | jobs
by manthideaal 2270 days ago
it seems that neural networks have problem for computing the maximum function, and a human can compute the maximum easily, so it seems that the three heuristic rules don't work in this case.

(1) https://datascience.stackexchange.com/questions/56676/can-ma...

2 comments

You're referring to about whether a generic one-size-fits-all model will do well, but ML is full of bespoke models. It would be simple to build a neural network that can compute (and differentiate through) the max function to within some arbitrary epsilon, even though the most generic model (feed forward network) won't do great.
> l (feed forward network) won't do great.

See my answer below, in the case of this problem a generic feed-forward network, even a simple one, will work.

Not any ffn, but assuming you are using an efficient architecture search it will probably find one that works.

There's other numerical problems where this doesn't hold but that's another story.

You demonstrated it for a reeeeeeeallly constrained version of the problem. Do you expect your solution would generalize to many lists? Because it would be easy to make a neural network that does, while your toy example (and larger generalizations) probably won't generalize super well.

x_i = ith list element from list x

y = sum(x_i * softmax(k * x)_i)

This one parameter, arbitrarily wide network one will get arbitrarily close to the max function.

This is a super toy version of why attention is so effective. It can pick stuff.

That is untrue,

Here's a code example (actually took me ~20 minutes to get it "right" so I'll admit it's not the most trivial problem)... it includes seeds so that you can replicate locally (it should hit 100% accuracy all the time on the 1200 examples testing set reliably by about epoch 700):

'''

import torch import random from sklearn.metrics import accuracy_score

random.seed(61) torch.manual_seed(61)

X = [[random.random() for x in range(2)] for x in range(2000)] X_train = torch.FloatTensor(X[0:800]).cuda() X_test = torch.FloatTensor(X[800:]).cuda() X = torch.FloatTensor(X)

Y = [] for x in X: y = [0] * len(x) y[torch.argmax(x)] = 1 Y.append(y)

Y_train = torch.FloatTensor(Y[0:800]).cuda() Y_test = torch.FloatTensor(Y[800:]).cuda()

shape = [2,2] layers = [] for ind in range(len(shape) - 1): layers.append(torch.nn.Linear(shape[ind],shape[ind+1],bias=False))

net = torch.nn.Sequential(layers).cuda()

optim = torch.optim.SGD(net.parameters(), lr=1) criterion = torch.torch.nn.CrossEntropyLoss()

dataset = torch.utils.data.TensorDataset(X_train, Y_train) dataloader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=10)

for i in range(pow(10,6)): for X,Y in dataloader: Yp = net(X) loss = criterion(Yp, Y.max(1).indices) loss.backward() optim.step() optim.zero_grad()

    if i > 500:
        optim = torch.optim.SGD(net.parameters(), lr=0.002)

    if i % 20 == 0:
        Yp = net(X_test)
        print('Training loss: ', loss.item())
        print(f'\nAccuracy score for epoch {i}:')
        print(accuracy_score(Yp.max(1).indices.tolist(),Y_test.max(1).indices.tolist()))
'''

This is as basic as you can get, predict the max out of 2 numbers, only uses a total of 4 node:

2 inputs (the 2 number) -> 2 outputs (the index of the maximum numbers). just 2 weight being optimize, no biases no nothing, as simple an implementation as you can get in terms of size.

There's also way to do it (apparently) where instead of treating it as "find the max index" you treat it as "output the maximum number": https://www.quora.com/Can-deep-neural-networks-learn-the-min...

But the approach I have will generalize to e.g. "Find the max of 5 or 100 or 1000 numbers" (although I assume it might take some time)

And overall you have no guarantee, that's why I qualified the statement and didn't say "Literally any imaginable problem that a human can solve without context".

To some extent it also matter how you encoder your number, you can train a 10000000000 parameter FCNN with RELU activations until the end of time to learn a simple mutliplication, and it won't be able to do so if you don't log encode your numbers or use some encoding or activation that means `` can be transposed in the `+` operations being done inside the nodes to combine the outputs... because that's outside of the scope of mathematics that given netwrok can do.

But, unless you are specifically trying to come up with an edge case and are instead looking at real world problems and trying to design the network in such a way as to best handle them (and this doesn't have to be all manual, you can use various NAS techniques), the rule will hold most of the time I believe.

You are right. Also the Universal Approximation theorem (1) for neural networks guarantees that neural networks can approximate continuous function on compact subsets of R^n, in this case max(x,y).

argmax([x,y]) = (sign(x[0]-x[1])+1)/2

Going beyond continuous functions, can deep learning be used for primality test?

(1) https://en.wikipedia.org/wiki/Universal_approximation_theore...

https://escholarship.org/content/qt5sg7n4ww/qt5sg7n4ww.pdf

> A long-standing difficulty for connectionism has been to implement compositionality, the idea of building a knowledge representation out of components such that the meaning arises from the meanings of the individual components and how they are combined. Here we show how a neural-learning algorithm, knowledge-based cascade-correlation (KBCC), creates a compositional representation of the prime-number concept and uses this representation to decide whether its input n is a prime number or not. KBCC conformed to a basic prime-number testing algorithm by recruiting source networks representing division by prime numbers in order from smallest to largest prime divisor up to √n. KBCC learned how to test prime numbers faster and generalized better to untrained numbers than did similar knowledge-free neural learners. The results demonstrate that neural networks can learn to perform in a compositional manner and underscore the importance of basing learning on existing knowledge.

But again, I think things such as prime number tests are the exact kind of edge cases where one needs too many heuristics built into the model for it to be practical to use.

But I think something like a prime test is not included under the definition I gave anyway, because the idea of "prime" actually implies a lot of context.

You can take a baby and he will be able to classify images, you can take a human that speaks a language with no concept of numbers and he will be able to play or sing music and distinguish patterns in it.

You can't talk about "prime" without a mathematical apparatus that takes years for humans to understand. However, since we learn it as such an early age, it ends up in the background.

Granted, that could be said about almost any cognitive ability (the fact that there's a lot of "subconscious context" required to use it).... so I don't know.