|
Hi HN community, I have been working on benchmarking publicly available LLMs these past couple of weeks. More precisely, I am interested on the finetuning piece since a lot of businesses are starting to entertain the idea of self-hosting LLMs trained on their proprietary data rather than relying on third party APIs. To this point, I am tracking the following 4 pillars of evaluation that businesses are typically look into:
- Performance
- Time to train an LLM
- Cost to train an LLM
- Inference (throughput / latency / cost per token) For each LLM, my aim is to benchmark them for popular tasks, i.e., classification and summarization. Moreover, I would like to compare them against each other. So far, I have benchmarked Flan-T5-Large, Falcon-7B and RedPajama and have found them to be very efficient in low-data situations, i.e., when there are very few annotated samples. Llama2-7B/13B and Writer’s Palmyra are in the pipeline. But there’s so many LLMs out there! In case this work interests you, would be great to join forces. GitHub repo attached — feedback is always welcome :) Happy hacking! |
On a separate note, I have received a few questions about the value-add of this repository. Here is my take and my vision for this repository:
Before starting this project, I realised that while there are a ton of resources that talk about using these models for chat inference and QnA over documents — no one did a good job of stress-testing them on sample complexity.
We all know that LLMs have the power of generalisability but how do they actually compare to the likes of BERT and Distilbert that have become household names in the world of NLP. Can these LLMs compare with them on tasks beyond chat? Like classification, Named entity recognition, etc?
If you go over to a model folder, let’s say Flan or Falcon, you will notice that the README has a rich documentation of our research findings. This, I guarantee you, you won’t find anywhere else. Additionally, the inference section has a good study of how these models fare when the number of requests go up, and the associated costs.
I will end by saying that a lot of people and repositories are just riding the wave of the buzz surrounding LLMs without answering a lot of questions that data scientists and ML engineers actually have. And those questions (4 pillars of evaluation framework) are necessary to answer for enterprises to build software — not just slap together a chat interface / UI on top of the latest LLM, and then calling it a revolutionary product.