Hacker News new | ask | show | jobs
by gdiamos 618 days ago
RNNs always had better scaling law curves than transformers.

BPTT was their problem