Hacker News new | ask | show | jobs
by nravic 357 days ago
This is super interesting! We do something similar I think by taking a checkpoint after model initialization. I'm curious what you think about our approach, here's some benchmarks: https://docs.cedana.ai/articles/performance-of-cedanas-gpu-i...

We do some on-the-fly optimizations as well (like compiling into CUDA graphs or fusing together calls) which ends up resulting (for some inference engines) faster token throughput too.