|
|
|
|
|
by taykolasinski
164 days ago
|
|
OP here. I spent the last few days reproducing the mHC architecture from the recent DeepSeek paper (2512.24880). Two key takeaways from the reproduction: Unconstrained Hyper-Connections really do explode (7x amplification even at 10M scale). I hit a nasty "stream persistence" bug where my tensors were the right shape, but the architecture was functionally broken. This is Part 1 (10M scale). Part 2 (scaling to 1B on A100s) is coming later this week. Happy to answer questions about the implementation. |
|