Hacker News new | ask | show | jobs
by karpathy 754 days ago
That is 100% my intention and hope and I think we are very close to deleting all of that. Right now on master, I am already only using Python for the tokenization preprocessing. In principle the requirements for llm.c should be extremely minimal. I think this a few days of work that is high on my mind.

Biggest problem right now is finding a place that can host the 135GB of tokens for FineWeb100B. Will probably use S3 or something.

Related see: https://github.com/karpathy/llm.c/issues/482

1 comments

Could this be a good case for a torrent?