|
|
|
|
|
by sumo43
1000 days ago
|
|
Cool service. It's worth noting that, with quantization/QLORA, models as big as llama2-70b can be run on consumer hardware (2xRTX 3090) at acceptable speeds (~20t/s) using frameworks like llama.cpp. Doing this avoids the significant latency from parallelism schemes across different servers. p.s. from experience instruct-finetuning falcon180b, it's not worth using over llama2-70b as it's significantly undertrained. |
|
We developed Petals for people who have less GPU memory than needed. Also, there's still a chance of larger open models being released in the future.