Hacker News new | ask | show | jobs
by acchow 1166 days ago
Can someone explain why computing a delta needs to hold the entire model at once? Can't it just do one layer at time?