|
|
|
|
|
by swyx
687 days ago
|
|
i covered SAM 1 a year ago (https://news.ycombinator.com/item?id=35558522). notes from quick read of the SAM 2 paper https://ai.meta.com/research/publications/sam-2-segment-anyt... 1. SAM 2 was trained on 256 A100 GPUs for 108 hours (SAM1 was 68 hrs on same cluster). Taking the upper end $2 A100 cost off gpulist means SAM2 cost ~$50k to train - surprisingly cheap for adding video understanding? 2. new dataset: the new SA-V dataset is "only" 50k videos, with careful attention given to scene/object/geographical diversity incl that of annotators. I wonder if
LAION or Datacomp (AFAICT the only other real players in the open image data space) can reach this standard.. 3. bootstrapped annotation: similar to SAM1, a 3 phase approach where 16k initial annotations across 1.4k videos was then expanded to 63k+197k more with SAM 1+2 assistance, with annotation time accelerating dramatically (89% faster than SAM1 only) by the end 4. memory attention: SAM2 is a transformer with memory across frames! special "object pointer" tokens stored in a "memory bank" FIFO queue of recent and prompted frames. Has this been explored in language models? whoa? (written up in https://x.com/swyx/status/1818074658299855262) |
|