| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nl 1066 days ago

Testing steps (based on thinking about this for 30 seconds - so probably can be improved):

Train a Transformer based model with and without the modified Softmax (Suggestions: GPT-2 or nanoGPT)

Measure performance - I'd probably start with Perplexity and see if there is any difference (we'd expect little difference).

Quantize both models with different quantization strategies.

Measure the perplexity of the quantized models of different sizes. We'd expect the performance to drop off quicker for the non-modified model than the modified one if this is working.

2 comments

renewiltord 1066 days ago

I was thinking about a different problem as I was typing that and got some mental memory alias bug. I wanted to know a set of steps to take to train a model. My apologies.

In any case, that was an lmgtfy-level question. Here's what I found: https://til.simonwillison.net/llms/training-nanogpt-on-my-bl...

I shall try that soon.

link

mcapodici 1066 days ago

Shaaaaameless plug:

I did a writeup like this. (Not as nicely as Simon though) where I modal.com (cloud GPU, containers, quick starts, free $30/m spend) to use their GPUs (e.g. T4, A100).

https://martincapodici.com/2023/07/15/no-local-gpu-no-proble...

T4 I think was good enough for the job, not much need for the A100.

Since this post I am working on an easy way to do this with a script called lob.py that requires no code changes to the nanoGPT repo (or whatever repo you are using) and runs in modal.com. The script exists but gets refined as I use it. Once it is battle tested a bit more I will do a post.

(It is named lob.py as it "lobs the code over to the server" where lob is UK slang for throw)

Watch this space.

link

renewiltord 1066 days ago

Thank you. FWIW I often find write-up + script superior to script because I often want to modify. e.g. I want to run GPU-only, but other script provide part-way solution when textual description added. Therefore, much appreciated.

link

nl 1066 days ago

In the Qualcomm AI paper linked in this post it turns out they use a similar testing approach:

BERT 109M, testing perplexity

OPT 125M, testing perplexity

ViT 22M, testing on ImageNet top-1.

link