Hacker News new | ask | show | jobs
by supermatt 1212 days ago
> We considered trying to use a self-hosted LLM as an alternative, but the costs would also have been extremely high for the amount of traffic we were processing.

Is it realistic to self-host an LLM that outperforms OpenAIs offerings cost wise? When I looked at the alternatives (self-hosted, alternate hosted LLM providers, or cloud compute options) you generally ended up with a subjectively worse model AND a lower inference speed - which resulted in me canning my idea as it was simply too expensive.

1 comments

Flan-T5 is much smaller than GPT-3, but was trained on significantly more data resulting in competitive accuracy. It is also Apache licensed. I wonder if that model is fast enough for enough use cases to make it cost effective?
You can give the Xl (3B parameters) model a try here (would recommend a Colab Pro account): https://colab.research.google.com/drive/1Hl0xxODGWNJgcbvSDsD...

In my Colab Pro it's running this on a A100 (which is a very beefy GPU) and inference is very fast and definitely suitable for interactive use. On a T5 GPU (which is much cheaper) inference is still alright and probably ok for interactive use.

I think Flan-T5 is fast enough, but I don't think it generates text or abstract reasoning at nearly the same level as current GPT-3 models. This indicates a deficiency in the benchmarks and metrics that we use to evaluate LLMs. For generating embeddings it might work well enough though.
It's certainly not quite as good out of the box, at least the open sourced checkpoints. However so far I found it can achieve similar accuracy with enough examples and/or fine-tuning for my use cases. Like everything, it depends on what are doing too.
For embeddings, it may be overkill. Smaller BERT-type models can provide good embeddings when fine tuned with a contrastive learning objective. Eg: https://sbert.net.
Fine-tuning on smaller models like GPT-J (also trained on The Pile) worked well for Toolformer:

https://arxiv.org/abs/2302.04761