| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Someone 2807 days ago

”When you increase the number of parameters (weights) in an NN by a factor of 5 you don’t just get 5 times the capacity and need 5 times as much training data. In terms of expressive capacity increase it’s more akin to a number with 5 times as many digits. So if V8’s expressive capacity was 10, V9’s capacity is more like 100,000.”

I find it very, very hard to believe that. I know ‘expressive power’ is a fairly vague concept, but if things scale that well, there must be papers out there that at least hint at such (IMHO) insane scaling laws.

I think it also must mean that it is fairly easy for those with huge budgets to build a system that’s way better, except for the fact that it is too slow or takes too much power (just as 3D graphics in movies show what will be on our desks/phones in a decade or two)

I’ve asked it before, but does anybody know of papers that describe an offline self-driving system that’s as good as perfect?

3 comments

loser777 2807 days ago

It seems that there is stronger evidence that model architecture is more important than sheer number of parameters/computational complexity. Compare VGG-16 with something like MobileNet or ResNet-18.

link

X6S1x6Okd1st 2807 days ago

Here is a highly specific definition of "expressive power" https://en.m.wikipedia.org/wiki/VC_dimension

To me the statement doesn't seem that unreasonable

link

mannykannot 2807 days ago

Thanks for bringing this to my attenton. In that article, however, it says for neural nets:

V is the set of nodes. Each node is a simple computation cell. E is the set of edges, Each edge has a weight. .... If the activation function is the sigmoid function and the weights are general, then the VC dimension is ... at most O(|E|^2.|V|^2) [apologies for the crappy formatting]

While Someone's quote from the article seems to be suggesting something exponential in the number of edges.

link

Someone 2806 days ago

I haven’t even tried to hunt down the book referenced on Wikipedia, but I think it’s worse than that. The Wikipedia page says ”The VC dimension of a neural network is bounded as follows”. “Is bounded” is an expression in mathematics that is more about what we know about a problem, than about the problem itself (as a classical example, see https://en.wikipedia.org/wiki/Graham's_number#Context. Graham’s number ‘bounds’ a number whose value we know to be at least 13)

Given the huge range between that upper bound and the lower bound of Ω(|E|²), chances are that upper bound is far from tight (https://en.wikipedia.org/wiki/Upper_and_lower_bounds#Tight_b...).

Also, one line below the O(|E|².|V|²) you quoted:

”If the weights come from a finite family (e.g. the weights are real numbers that can be represented by at most 32 bits in a computer), then, for both activation functions, the VC dimension is at most O(|E|)”

Of course, they may use a different activation function, in which case that mathematical statement doesn’t apply, but I would think it’s more unlikely that applies than the claim made on the article we’re discussing.

For example, it would hugely surprise me if using an activation function that isn’t increasing or that has many large discontinuities behaves a lot better than the sigmoid surely used.

link

X6S1x6Okd1st 2807 days ago

Good point

link

shoyer 2807 days ago

Strongly agreed. I'd love to see a paper a backing up this claim but I'm pretty sure it's just wrong.

Likewise, I would be quite surprised if Tesla is really pushing state of the art for the size of their vision models with what they've deployed in cars. Researchers have built some pretty big models...

link