|
|
|
|
|
by iggldiggl
1638 days ago
|
|
> Just doing some napkin math, the whole GPT-J corpus was around 500 billion tokens, which at 4 tokens per byte would be roundabout 2 Terabyte. "Only" 825 GB actually: https://pile.eleuther.ai/ A not-insignificant fraction of that is definitively copyrighted material, though, which raises some interesting questions when switching to a model of distributing "a smaller trained model plus the original raw training data" (though it seems that the team behind GPT-J are clearly happy to distribute their full set of data anyway, and seem to be enough under the radar to not attract the wrong sort of attention,at least for now). |
|