| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by AMICABoard 699 days ago

My bad, I directly linked to the C file instead of the project here:

It is a program that given a model file, tokenizer file and a prompt, it continues to generate text.

To get it to work, you need to clone and build this: https://github.com/trholding/llama2.c

So the steps are like this:

First you'll need to obtain approval from Meta to download llama3 models on hugging face.

Go to https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct, fill the form and then go to https://huggingface.co/settings/gated-repos see acceptance status. Once accepted, do the following to download model, export and run.

huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --include "original/*" --local-dir Meta-Llama-3.1-8B-Instruct

cd llama2.c/

# Export Quantized 8bit

python3 export.py ../llama3.1_8b_instruct_q8.bin --version 2 --meta-llama ../Meta-Llama-3.1-8B-Instruct/original/

# Fastest Quantized Inference build

make runq_cc_openmp

# Test Llama 3.1 inference, it should generate sensible text

./run ../llama3.1_8b_instruct_q8.bin -z tokenizer_l3.bin -l 3 -i " My cat"