| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by quickthrower2 1145 days ago

I am quite new to this, I would like to get it running. Would the process roughly be:

1. Get a machine with decent GPU, probably rent cloud GPU.

2. On that machine download the weights/model/vocab files from https://huggingface.co/openlm-research/open_llama_7b_preview...

3. Install Anaconda. Clone https://github.com/young-geng/EasyLM/.

4. Install EasyLM:

    conda env create -f scripts/gpu_environment.yml
    conda activate EasyLM

5. Run this command, as per https://github.com/young-geng/EasyLM/blob/main/docs/llama.md:

    python -m EasyLM.models.llama.llama_serve \
         --mesh_dim='1,1,-1' \
         --load_llama_config='13B' \
         --load_checkpoint='params::path/to/easylm/llama/checkpoint' \

Am I even close?

2 comments

jbandela1 1145 days ago

I think llama.cpp might be easier to set up and get running.

https://github.com/ggerganov/llama.cpp

link

loudmax 1145 days ago

I second this recommendation to start with llama.cpp. It can run on a regular laptop and it gives a sense of what's possible.

If you want access to a serious GPU or TPU, then the sensible solution is to rent one in the cloud. If you just want to run smaller versions of these models, you can achieve impressive results at home on consumer grade gaming hardware.

The FastChat framework supports the Vicuna LLM, along with several others: https://github.com/lm-sys/FastChat

The Oobabooga web interface aims to become the standard interface for chat models: https://github.com/oobabooga/text-generation-webui

I don't see any indication that OpenLLaMa will run on either of those without modification. But one of those, or some other framework may emerge as a de-facto standard for running these models.

link

JLCarveth 1145 days ago

Yes, I can clone this and get into a prompt in less than 5 minutes on an M2 MBA.

link

quickthrower2 1145 days ago

might try it first. seems to be only CPU?

link

azeirah 1145 days ago

It has partial gpu acceleration if you compile it with LLAMA_CUBLAS or LLAMA_CLBLAST

They really have come a long way since... A few weeks ago.

Using cublas with my 1080ti results in a 52% speedup compared to cpu-only. Vram usage is very minimal.

link

themulticaster 1145 days ago

I'd see that as a benefit of llama.cpp - it's specifically designed to be usable on consumer hardware such as laptops, without professional GPUs.

link

thundergolfer 1145 days ago

You can get it running with one Python script on Modal.com :)

https://github.com/modal-labs/modal-examples/blob/main/06_gp...

link

quickthrower2 1145 days ago

Ok you lot! Will try out modal.

link

quickthrower2 1145 days ago

Yeah it is pretty nice. Not sure how long it took, but less that the time to make a sandwich (2 minutes). It cost 2-3c a pop so sadly more expensive than GPT3.5. However maybe it can be optimised. Or maybe there is some init cost that could be store in state.

    (modal) fme:/mnt/c/temp/modal$ modal run openllama.py
    ? Initialized. View app at https://modal.com/apps/ap-9...
    ? Created objects.
    +-- ?? Created download_models.
    +-- ?? Created mount /mnt/c/temp/modal/openllama.py
    +-- ?? Created OpenLlamaModel.generate.
    +-- ?? Created mount /mnt/c/temp/modal/openllama.py
    Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]Downloading shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:00<00:00, 1733.54it/s]
    Loading checkpoint shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:12<00:00,  5.70s/it]Loading checkpoint shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:12<00:00,  6.23s/it]
    Building a website can be done in 10 simple steps:
    1. Choose a domain name. 2. Choose a web hosting service. 3. Choose a web hosting package. 4. Choose a web hosting plan. 5. Choose a web hosting package. 6. Choose a web hosting plan. 7. Choose a web hosting package. 8. Choose a web hosting plan. 9. Choose a web hosting package. 10. Choose a web hosting plan. 11. Choose a web hosting package. 12. Choose a web hosting package. 13. Choose a web hosting package. 14. Choose a web hosting
    ? App completed.

link

thundergolfer 1144 days ago

Thanks for trying it out!

2-3c per run seems very high. That's probably just the cost if you have to spin up a new container. You can shorten the idle timeout on a container if its going to just serve one request typically. If it's going to serve more requests, then the startup and idle shutdown cost is amortized over more requests :)

link

quickthrower2 1144 days ago

I found this was the cost per call to a web function. I used deploy to deploy it. The function just does what the main did in the example repo (earlier in this theead)

link