Hacker News new | ask | show | jobs
by endymi0n 1641 days ago
Just doing some napkin math, the whole GPT-J corpus was around 500 billion tokens, which at 4 tokens per byte would be roundabout 2 Terabyte. That, parked on a fast NVMe SSD will give you roundabout 1MM random lookups per second. Even with some transfers inbetween, this should be more than enough to not just perform in equal time, but probably less — as well as cost you less than the GPU you need for the reduced size model.

Exciting times.

2 comments

The real problem with (NVMe) SSD is that they have a limited number of write cycles (a max TB written).

If you don't update your database and indices they are great. But that's something really tempting to do when you do some machine learning, (specially if you know that people with deeper pockets will do so).

Typically you will have a neural network, you run it on your dataset, it produces a new dataset of embeddings, you index them, and you use this index to train a new neural network, and you repeat the loop, hopefully improving results along the way.

NVMe SSD can write at 6GB/s but can only write ~800TB that's about 37 hours of lifetime at max speed.

> Just doing some napkin math, the whole GPT-J corpus was around 500 billion tokens, which at 4 tokens per byte would be roundabout 2 Terabyte.

"Only" 825 GB actually: https://pile.eleuther.ai/

A not-insignificant fraction of that is definitively copyrighted material, though, which raises some interesting questions when switching to a model of distributing "a smaller trained model plus the original raw training data" (though it seems that the team behind GPT-J are clearly happy to distribute their full set of data anyway, and seem to be enough under the radar to not attract the wrong sort of attention,at least for now).

Not pointing out such potential problems in public forums is likely to extend the possibility that it remains readily available.
Touché. (Though with regard to those particular problematic bits, they already tweeted themselves about it, and that tweet had more likes than this submission currently has points)