Hacker News new | ask | show | jobs
by mike_hearn 1036 days ago
Right, I get that: it increases learning capacity, but doesn't introduce more parameters? Like the GPU requirements would be the same beyond the extra cost of the ReLU operation itself, yes?
1 comments

Yes of course, sorry my write-up was confusing: I meant that "adding a ReLU between the two linear layers" (the second option) would result in more parameters than "directly removing the second linear layer" (the first option). And my message just meant "I don't know which of the two options achieves the best trade-off between speed and quality". I didn't consider the option "leave it as it is in the blog post" because it is essentially equivalent to the first option (removing the linear layer) but slower (as you say, with exactly the same number of parameters as the second option), so it definitely shouldn't be a "best" option.
Thank you!