Hacker News new | ask | show | jobs
by iggldiggl 1638 days ago
> Just doing some napkin math, the whole GPT-J corpus was around 500 billion tokens, which at 4 tokens per byte would be roundabout 2 Terabyte.

"Only" 825 GB actually: https://pile.eleuther.ai/

A not-insignificant fraction of that is definitively copyrighted material, though, which raises some interesting questions when switching to a model of distributing "a smaller trained model plus the original raw training data" (though it seems that the team behind GPT-J are clearly happy to distribute their full set of data anyway, and seem to be enough under the radar to not attract the wrong sort of attention,at least for now).

1 comments

Not pointing out such potential problems in public forums is likely to extend the possibility that it remains readily available.
Touché. (Though with regard to those particular problematic bits, they already tweeted themselves about it, and that tweet had more likes than this submission currently has points)