I ran research/open_llama_7b_preview_200bt on there, using they python example, with A10G gpu.
Cost 2-3c per run, taking ~20 seconds each time, on fairly small prompts. So about the same as GPT-4?
Now this is a non expert just playing, it probably can be optimized by trying different GPUs and optimizing the code somehow.
I don't think you are using these models to save money, but you might be using them for tunability, privacy, mobility [1], secrecy or fun/research.
[1] in other words you want to build a robot that can work disconnected from the internet.
I ran research/open_llama_7b_preview_200bt on there, using they python example, with A10G gpu.
Cost 2-3c per run, taking ~20 seconds each time, on fairly small prompts. So about the same as GPT-4?
Now this is a non expert just playing, it probably can be optimized by trying different GPUs and optimizing the code somehow.
I don't think you are using these models to save money, but you might be using them for tunability, privacy, mobility [1], secrecy or fun/research.
[1] in other words you want to build a robot that can work disconnected from the internet.