I didn't get as good results as Karpathy (unlucky seed?)
It's fun to play with though...
User: How many legs does a dog have?
Assistant: That's a great question that has been debated by dog enthusiasts for centuries. There's no one "right" answer (...)
cd /tmp
git clone https://huggingface.co/sdobson/nanochat
uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0211508969a100a/raw/80f79c6a6f1e1b5d4485368ef3ddafa5ce853131/generate_cpu.py \
--model-dir /tmp/nanochat \
--prompt "Tell me about dogs."
This is a much easier way to run the model. I'm going to update the huggingface README to point to this. The one thing that could be improved is the turn-taking between user and assistant, which it sometimes gets confused about. I fixed that in my fork of your gist here: https://gist.github.com/samdobson/975c8b095a71bbdf1488987eac...
Simon, I had to run "brew install git-lfs && cd nano-chat && git lfs install && git lfs pull" and then it worked. before then, the model weights didn't get cloned by default for me on macOS.
% uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0... \
--model-dir nanochat/ --prompt "who is simonw on hacker news?"
Using device: cpu
Loading model from nanochat/model_000650.pt
Loading metadata from nanochat/meta_000650.json
Model config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280}
Loading model weights (this may take a minute for a 2GB model)...
Converting model to float32 for CPU...
Model loaded successfully!
Loading tokenizer...
Tokenizer loaded successfully!
Prompt: who is simonw on hacker news?
Encoded to 9 tokens
Generating...
--------------------------------------------------
who is simonw on hacker news?<|user_end|><|assistant_start|>A hacker news reporter, I'd say a few things. First, I'm a bit of a hothead, always pushing the boundaries of what's acceptable in the world of hacking. I've got a reputation for being merciless and relentless in my pursuit of the truth.
In many ways, I've developed a sixth sense for this type of thing. I've spent years honing my skills, learning the language of hacking and the tactics it takes. I know how to think like the hacker
--------------------------------------------------
Adding on: Claude also gave me the following line which was necessary to get the model weights to download from HF. This might be obvious for anyone familiar with HF but it helped me so sharing here!
For anyone curious this is the error when running uv sync on macos,
> uv sync
Resolved 88 packages in 3ms
error: Distribution `torch==2.8.0+cu128 @ registry+https://download.pytorch.org/whl/cu128` can't be installed because it doesn't have a source distribution or wheel for the current platform
hint: You're on macOS (`macosx_15_0_arm64`), but `torch` (v2.8.0+cu128) only has wheels for the following platforms: `manylinux_2_28_x86_64`, `win_amd64`; consider adding your platform to `tool.uv.required-environments` to ensure uv resolves to a version with compatible wheels
Also, tmp/nanochat expects all contents from tokenizer and chatsft_checkpoints folder.
Yeah, that's because cuda on a mac isn't a thing - it could be swapped to the normal torch package but you'd have to do some code patching to make sure it's running on mps, even then some of the code may need rewriting/patching if there's no mps version of the cuda kernals.
Isn't there a common PyTorch API interface that could chose OS/hardware specific backend automatically? Or this project is hard coding cuda variant of PyTorch as a requirement?
>Our main measure of progress. Bits per byte is, per Karpathy, "a much better measure than just the typical cross-entropy loss, because it further normalizes the loss on each token by the number of bytes of that token, making the metric tokenizer-invariant".
Is so blindingly obvious, that I'm ashamed to think that I didn't think do it when trialing my own tokenizer approach on tinystories. I might go back and have a look at how well my tokenizer compared to how well I imagined it compared.
ELI5 for anyone else (I had to have this explained to me):
When you train a language model, it tries to predict the next token.
We measure how good it is at that using loss aka how surprised it was by the real answer.
Different models might use different token lengths. So, if you describe loss relative to tokens then you can't easily compare the performance of two models that use different token lengths.
Tokenizers used to be 1 character per token. Then Google implemented Subword encoding[1] on their early neural translation work and found it was much better.
Subword units are genuinely meaningful in most languages. You do need to tune the vocabulary size though.
absolutely requires longer training time and more compute.
once trained, predictions need to hold through many more steps because each step processes one token. if a token early in a sentence heavily implies a token will occur later in the sentence then that awareness needs to be maintained while processing each intermediary token and each step is a bit lossy. the fewer steps you need to take before leveraging that knowledge the better the prediction.
if you had infinite compute and data for training then performance would be equivalent though, i think.
Since OpenAI tokenizer is estimated at ~4.2 characters per token, with your proposed "1 char per token tokenizer", the effective context length immediately becomes 4.2 times smaller, and generated output 4.2 times slower (since 4.2 times more tokens are needed for the same output). Doesn't look like a good tradeoff.
Cool. Is there a simple "howto" on running this repo with training on W&B for a programmer like me who has never done model training flows? Maybe you could share the steps you took?
There's not much to it... it took longer to spin up the cloud machine than it did to kick off the training run. I'll be writing up a blog post with a step-by-step guide when I get a free moment, but in the meantime, here are the commands I ran: https://pastebin.com/sdKVy0NR
I didn't get as good results as Karpathy (unlucky seed?)
It's fun to play with though...
User: How many legs does a dog have? Assistant: That's a great question that has been debated by dog enthusiasts for centuries. There's no one "right" answer (...)