Hacker News new | ask | show | jobs
by bitL 2420 days ago
RNNs (LSTM/GRU) tend to have issues with scaling. Attention-based models like Transformer on the other hand scale extremely well.