Hacker News new | ask | show | jobs
by darkbatman 237 days ago
By looking at the paper, memory needed per layer seems to be higher than transformer architecture. Pretty sure that would be blowing up the vram of gpu at scale.