Tiny hackable CUDA language model implementation

Y	Hacker News new \| ask \| show \| jobs

	Tiny hackable CUDA language model implementation (github.com)
	81 points by markusheimerl 8 days ago

5 comments

yobbo 6 days ago

Looks very nice, but I can't find numerical gradient checks, which is helpful when verifying that backward pass is correct:

https://github.com/markusheimerl/gpt/blob/main/transformer/a...

link

markusheimerl 6 days ago

I deleted the numerical checks a while back after confirming the backward pass is correct to keep the code base lean - running https://github.com/markusheimerl/gpt/blob/main/transformer/a... is also somewhat of a confirmation that the backward pass is correct, since an analytically incorrect backward pass cant fit perfectly to synthetic data.

link

Gred_papa_dance 5 days ago

I need more info:

* where is data (make data) how create new my own data, (questions for chat?) * how create a tokenizer (meybe separate) * how stop the code, how many memory need, how setup size of context etc. * how creating a LORA or learn with new data. * how quantize model?

In my opinion this is great idea but making a Ruby extension will be goot way to increase users using this code.

link

markusheimerl 5 days ago

the data gets downloaded via curl from huggingface - sure you can make your own data, simply dump all text you want the model to be trained on into "corpus.txt" and skip "make data".

As the tokenizer adds substantial complexity, this implementation does not include any tokenziation logic and works on raw bytes. Feel free to add your own tokenzier with the help of the coding model of your choice.

You can stop the training using CTRL+C You can train on as little memory as you have. Simply reduce batch size and/or model dimensions in train.c You can change the context window size in train.c via the "seq_len" variable.

Regarding Ruby, LORA and quantization I'll have to refer you to the coding agent of your choice.

link

ewew53 5 days ago

Meybe add a simple step betwen start and train:

convert text data to binary data. This help converting a differend data.

(please make 8 bit format, 16, 32 bit format)

link

qqqqqlqq 5 days ago

$make run -j 10

CUDA error in attention.c:91: out of memory

Command exited with non-zero status 1

1.38user 0.46system 0:00.75elapsed 246%CPU (0avgtext+0avgdata 226164maxresident)k

0inputs+0outputs (0major+25414minor)pagefaults 0swaps

make: ** [Makefile:34: run] Błąd 1

clang: warning: CUDA version 12.4 is only partially supported [-Wunknown-cuda-version]

(I have ubuntu and 8GB memory NVIDIA GeForce RTX 3050 876MiB / 8192MiB )

link

markusheimerl 5 days ago

Reduce batch size in train.c

link

ewew53 5 days ago

how to do this? I have this same error

link

markusheimerl 5 days ago

https://github.com/markusheimerl/gpt/blob/main/train.c - in this file, search for the line "const int batch_size = 15;" - reduce this number

link

oakinnagbe 5 days ago

Nice implementation. Have you thought about supporting LoRA fine-tuning on top of this, or is the design too low-level for that kind of extension?

link

markusheimerl 5 days ago

Sure it could be extended to support LoRA finetuning but this implementation has the goal to be as lean and efficient as possible for a pre-training stack as you can be.

link

qqqqqlqq 5 days ago

It works on arm ?

link

markusheimerl 5 days ago

I did run it as a test on the NVIDIA Jetson Orin Nano Super Dev. Kit once - so yea it works on arm like a charm ;)

link