|
|
|
|
|
by mk67
765 days ago
|
|
I'm in the industry and nobody does that since over ten years. There was just a small phase when Hinton published "Greedy layer-wise training of deep networks" in 2007 and people did it for a few years at most. But already with the rise of LSTMs in the 2010s this wasn't done anymore and now with transformers also not. Would you care to share how you reached your conclusion as it matches 0 of my experience over the last 15 years and we also train large-scale LLMs in our company. There's just not much point to it when gradients don't vanish. |
|