Hacker News new | ask | show | jobs
by 303bookworm 540 days ago
Really excited to see this! 2 Questions: 1. Did you try using RTD (Electra like pretraining)? Or did you skip that for reasons of compatability? 2. Why not incorporate jamba like Mamba2 alternating layers?