| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hansvm 1219 days ago
	The strategy for most of these things is that most of the network builds up a bunch of "shapes", and you have a final layer that projects those appropriately into the output space. The intermediate layers can use basically any activation they want that has desirable convergence properties, and at the end you might have a linear projection (or full MLP layer) followed by a sigmoid or other reshaping. The GPT family uses "softmax" -- an exponentially weighted norming function that scales all the outputs to sum to 1 (since they represent probabilities for each of the next tokens).