| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sammyd56 247 days ago
	I'm doing a training run right now (started 20min ago). You can follow it at https://api.wandb.ai/links/sjd333-none/dsv4zkij Will share the resulting model once ready (4 hours from now) for anyone to test inference.

4 comments

sammyd56 247 days ago

I've uploaded the model here: https://huggingface.co/sdobson/nanochat

I didn't get as good results as Karpathy (unlucky seed?)

It's fun to play with though...

User: How many legs does a dog have? Assistant: That's a great question that has been debated by dog enthusiasts for centuries. There's no one "right" answer (...)

link

simonw 247 days ago

I got your model working on CPU on macOS by having Claude Code hack away furiously for a while. Here's a script that should work for anyone: https://gist.github.com/simonw/912623bf00d6c13cc0211508969a1...

You can run it like this:

  cd /tmp
  git clone https://huggingface.co/sdobson/nanochat
  uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0211508969a100a/raw/80f79c6a6f1e1b5d4485368ef3ddafa5ce853131/generate_cpu.py \
    --model-dir /tmp/nanochat \
    --prompt "Tell me about dogs."

link

sammyd56 247 days ago

This is a much easier way to run the model. I'm going to update the huggingface README to point to this. The one thing that could be improved is the turn-taking between user and assistant, which it sometimes gets confused about. I fixed that in my fork of your gist here: https://gist.github.com/samdobson/975c8b095a71bbdf1488987eac...

link

vessenes 247 days ago

Simon, I had to run "brew install git-lfs && cd nano-chat && git lfs install && git lfs pull" and then it worked. before then, the model weights didn't get cloned by default for me on macOS.

% uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0... \ --model-dir nanochat/ --prompt "who is simonw on hacker news?" Using device: cpu Loading model from nanochat/model_000650.pt Loading metadata from nanochat/meta_000650.json Model config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280} Loading model weights (this may take a minute for a 2GB model)... Converting model to float32 for CPU... Model loaded successfully! Loading tokenizer... Tokenizer loaded successfully!

Prompt: who is simonw on hacker news? Encoded to 9 tokens

Generating... -------------------------------------------------- who is simonw on hacker news?<|user_end|><|assistant_start|>A hacker news reporter, I'd say a few things. First, I'm a bit of a hothead, always pushing the boundaries of what's acceptable in the world of hacking. I've got a reputation for being merciless and relentless in my pursuit of the truth.

In many ways, I've developed a sixth sense for this type of thing. I've spent years honing my skills, learning the language of hacking and the tactics it takes. I know how to think like the hacker --------------------------------------------------

link

homeless_engi 247 days ago

Adding on: Claude also gave me the following line which was necessary to get the model weights to download from HF. This might be obvious for anyone familiar with HF but it helped me so sharing here!

git lfs install

link

iamcreasy 247 days ago

For anyone curious this is the error when running uv sync on macos,

> uv sync Resolved 88 packages in 3ms error: Distribution `torch==2.8.0+cu128 @ registry+https://download.pytorch.org/whl/cu128` can't be installed because it doesn't have a source distribution or wheel for the current platform

hint: You're on macOS (`macosx_15_0_arm64`), but `torch` (v2.8.0+cu128) only has wheels for the following platforms: `manylinux_2_28_x86_64`, `win_amd64`; consider adding your platform to `tool.uv.required-environments` to ensure uv resolves to a version with compatible wheels

Also, tmp/nanochat expects all contents from tokenizer and chatsft_checkpoints folder.

link

stoobs 247 days ago

Yeah, that's because cuda on a mac isn't a thing - it could be swapped to the normal torch package but you'd have to do some code patching to make sure it's running on mps, even then some of the code may need rewriting/patching if there's no mps version of the cuda kernals.

link

iamcreasy 246 days ago

Isn't there a common PyTorch API interface that could chose OS/hardware specific backend automatically? Or this project is hard coding cuda variant of PyTorch as a requirement?

link

Lerc 247 days ago

The comment beside the first chart

>Our main measure of progress. Bits per byte is, per Karpathy, "a much better measure than just the typical cross-entropy loss, because it further normalizes the loss on each token by the number of bytes of that token, making the metric tokenizer-invariant".

Is so blindingly obvious, that I'm ashamed to think that I didn't think do it when trialing my own tokenizer approach on tinystories. I might go back and have a look at how well my tokenizer compared to how well I imagined it compared.

link

SeanAnderson 247 days ago

ELI5 for anyone else (I had to have this explained to me):

When you train a language model, it tries to predict the next token.

We measure how good it is at that using loss aka how surprised it was by the real answer.

Different models might use different token lengths. So, if you describe loss relative to tokens then you can't easily compare the performance of two models that use different token lengths.

So, compare loss to bytes of text data instead.

link

typpilol 247 days ago

Why hasn't anyone made a tokenizer that's 1 character per token. Is it because it requires an insane amount of compute?

Or would the loss of efficiency make it dumber then modern tokenizers?

link

nl 247 days ago

Tokenizers used to be 1 character per token. Then Google implemented Subword encoding[1] on their early neural translation work and found it was much better.

Subword units are genuinely meaningful in most languages. You do need to tune the vocabulary size though.

[1] https://aclanthology.org/P16-1162/

link

SeanAnderson 247 days ago

yes to both.

absolutely requires longer training time and more compute.

once trained, predictions need to hold through many more steps because each step processes one token. if a token early in a sentence heavily implies a token will occur later in the sentence then that awareness needs to be maintained while processing each intermediary token and each step is a bit lossy. the fewer steps you need to take before leveraging that knowledge the better the prediction.

if you had infinite compute and data for training then performance would be equivalent though, i think.

link

skirmish 247 days ago

Since OpenAI tokenizer is estimated at ~4.2 characters per token, with your proposed "1 char per token tokenizer", the effective context length immediately becomes 4.2 times smaller, and generated output 4.2 times slower (since 4.2 times more tokens are needed for the same output). Doesn't look like a good tradeoff.

link

royosherove 247 days ago

Cool. Is there a simple "howto" on running this repo with training on W&B for a programmer like me who has never done model training flows? Maybe you could share the steps you took?

link

sammyd56 247 days ago

There's not much to it... it took longer to spin up the cloud machine than it did to kick off the training run. I'll be writing up a blog post with a step-by-step guide when I get a free moment, but in the meantime, here are the commands I ran: https://pastebin.com/sdKVy0NR

link

royosherove 247 days ago

Ah I was missing the WANDB_RUN env var. so did not get any logs. thanks!

link

bravura 247 days ago

The measures that drop exponentially like val/bpb and train/loss you should put the x-axis in log-scale. That will better show you if it's converged

link

sammyd56 247 days ago

Great call, thankyou - I switched to log scale for those metrics - agree that it is much clearer.

link

bravura 246 days ago

Sorry fat fingers. It should be the y axis that is log scale, not x axis. (Sometimes both is good.)

Did you notice the inflection point in which the loss drops faster than expected in the top graph? Maybe you should let it run more…

link