Hacker News new | ask | show | jobs
Show HN: finetune LLMs via the Finetuning Hub (github.com)
80 points by rsaha7 1020 days ago
Hi HN community, I have been working on benchmarking publicly available LLMs these past couple of weeks. More precisely, I am interested on the finetuning piece since a lot of businesses are starting to entertain the idea of self-hosting LLMs trained on their proprietary data rather than relying on third party APIs.

To this point, I am tracking the following 4 pillars of evaluation that businesses are typically look into: - Performance - Time to train an LLM - Cost to train an LLM - Inference (throughput / latency / cost per token)

For each LLM, my aim is to benchmark them for popular tasks, i.e., classification and summarization. Moreover, I would like to compare them against each other.

So far, I have benchmarked Flan-T5-Large, Falcon-7B and RedPajama and have found them to be very efficient in low-data situations, i.e., when there are very few annotated samples. Llama2-7B/13B and Writer’s Palmyra are in the pipeline.

But there’s so many LLMs out there! In case this work interests you, would be great to join forces.

GitHub repo attached — feedback is always welcome :)

Happy hacking!

5 comments

I have received a lot of great feedback. We are moving fast to add instructions of how to load your custom dataset, and how to choose prompts to give researchers a finer-level of control.

On a separate note, I have received a few questions about the value-add of this repository. Here is my take and my vision for this repository:

Before starting this project, I realised that while there are a ton of resources that talk about using these models for chat inference and QnA over documents — no one did a good job of stress-testing them on sample complexity.

We all know that LLMs have the power of generalisability but how do they actually compare to the likes of BERT and Distilbert that have become household names in the world of NLP. Can these LLMs compare with them on tasks beyond chat? Like classification, Named entity recognition, etc?

If you go over to a model folder, let’s say Flan or Falcon, you will notice that the README has a rich documentation of our research findings. This, I guarantee you, you won’t find anywhere else. Additionally, the inference section has a good study of how these models fare when the number of requests go up, and the associated costs.

I will end by saying that a lot of people and repositories are just riding the wave of the buzz surrounding LLMs without answering a lot of questions that data scientists and ML engineers actually have. And those questions (4 pillars of evaluation framework) are necessary to answer for enterprises to build software — not just slap together a chat interface / UI on top of the latest LLM, and then calling it a revolutionary product.

I don't see how the loading works for the end user's custom dataset. In fact, I find the layers of abstraction you have between getting the finetuning dataset and the actual training very opaque. I can't even tell where the dataset is coming from, it doesn't appear to be an example local to this repository.

I think a lot of people what something like... "drop .txt files of example data to train on in this /folder/ and run python finetune.py /folder/

You are right in that the loading is right now on huggingface’s dataset. The feedback about it being opaque has merit, and we are working on giving users more control and visibility into the dataset loading. To your point, adding instructions about how one can load their own dataset and do fine-tuning can assist researchers better in leveraging these models. That being said, the README under each model folder has all the info one needs to get started.

More than happy to have you contribute to the repo. There’s a lot of exciting work to be done.

It's hardcoded, hidden within the code there are calls to huggingface's datasets.load_dataset, but you don't get to specify on your own..
Great observation. We are working on making this part very explicit. The goal was to let researchers get up to speed with the codebase to begin with, and then they would understand what needs to change to make these models work on custom datasets.

That being said, we are working on adding instructions to specify dataset and also the prompt that users want to use.

This is actually what I was hoping for. For Web UI that you can load a model then load some data and hit train.

You can do this in the stable defusion UI to fine tune models with your own dataset

Feedback taken. We are working on making it more explicit for users to mention the task and dataset they want to train models on. Additionally, we will introduce a flag to let people mention the prompt they want to use for finetuning these models.
OobaBooga supports this kind of load-and-go LORA: https://github.com/oobabooga/text-generation-webui
Looked at the project. Great initiative.
Hey! I don’t understand enough abt llms. Fine tuning seems like something great but I feel locked out of it. I need to prepare data in a question answer format? I have started to play with taking things like text, articles, tweets and converting them to questions but I don’t think I’m doing best practices. Can you help explain how to take different data sources maybe like a list of documentation for an open source project and fine tune using it?
Great feedback! We are working on adding instructions on loading custom datasets for your own needs. What the format of the prompt should be, etc.

Next release will have these features.

To intentionally oversimplify, fine-tuning an LLM on your data is a completely nonsensical concept for 99% of the world.

People have the impression that training an LLM on your data will result in an LLM that can answer questions on your data. But for any realistic dataset and training that a non-FAANG can do, that's not true.

> Can you help explain how to take different data sources maybe like a list of documentation for an open source project and fine tune using it?

You would not do this.

Let's say you're writing code with an open source library. There's a new animation API that didn't exist when the LLM was trained:

1. You ask your coding chatbot: How do I make this this box move right to left across my screen

2. Before the chatbot UI submits the question to the LLM, it manually searches for text related to to your question in the library's documentation using BM25F and BERT

3. You give the LLM the results of your search and the user's question, at the same time.

The LLM now has a snippet of up to date documentation, and can look at that to produce a novel code that animates the box based on the documentation.

Depending on latency requirements you can have a "Step 2.5", where you ask the LLM "What searches would you do if I gave you the docs for this library and you needed to answer <insert question>".

_

Here BERT is being used to find snippets of text that are more likely to help us answer a question

For example, this model: https://huggingface.co/thenlper/gte-large

When given the query: Reference documentation for 'Write some code to move this square across my screen'

It ranks some imaginary documentation in the following order:

1. "Object translation has been reworked in the new animation API" 2. "Layout components include the Box, Grid, and Stack" 3. "Move your hosting to AWS with our cloud build API"

The BERT model "understands" that while we used the words moving and square: - "Move your hosting" is a semantically different concept - "Box" is similar to "Square", but "Box" is not central to the request.

Now we can give the LLM the most relevant snippet, and it uses that as guidance for its own reply (also known as Retrieval Augmented Generation)

_

There is a place where fine-tuning can actually be applied in this process but it is not fine-tuning an (chat) LLM. It is completely unrelated to most mentions of fine-tuning you've heard in the last X months.

You can fine-tune BERT (a much smaller model) to get better at finding relevant snippets of your documentation. You can do this without labeled data.

*Literally give it a bag of sentences and let it go to work*: https://www.sbert.net/examples/unsupervised_learning/TSDAE/R...

TSDAE doesn't really perform that well on wide domains, but it works well here the documents have both information, and what that information is for (think code documentation with examples vs wikipedia which is just raw information). It also only takes 1k sentences to start, you could find a bunch of random documentation sites on Github and feed them in.

I don’t think you fully understand the scope of this project. Your thinking and arguments are limited by your understanding of what all is possible with these models.

This repository argues that LLMs can be used for more applications beyond just chat, and QnA. Based on our experimental findings (which you would have found if you had the time to go through the README under any model folder), you can see LLMs do classification tasks really well under low data situations. For 99% of startup who don’t have the luxury of holding thousands of annotated samples like FAANG, LLMs provide a good alternative to get started with few annotated samples. At the end of the day, these models are based on attention transformer architecture.

I would be curious to see some quantitative backing of your statements and not just links to huggingface’s website & conjectures.

And btw, the entire ecosystem is trying to answer a lot of these questions because we are still early to predict anything. And here you are claiming they are absolutely non-sensical for 99% of companies.

Btw did you know that a lot of companies cannot use third-party APIs because of sensitive customer data? For them, having self-hosted models is a good alternative to have. And with the likes of Llama2 and Falcon closing the performance gap, the idea of self-hosted models for tasks beyond chat does not seem far-fetched.

This is interesting and I want to investigate using it for training templates.

I'm working on a couple of projects, one of which starts and manages GPU backed instances on Google Cloud: https://github.com/FeatureBaseDB/Laminoid

This is a very common use-case, and other users have mentioned this as well. We have taken this feedback, and will move fast to add instructions on how to leverage these models on custom-datasets and custom prompts. Stay tuned!
Thanks for putting this project together. How does your project differ from similar ones? (looked at the main repo)
Thanks a ton! And that’s a great question!

Before starting this project, I realised that while there are a ton of resources that talk about using these models for chat inference and QnA over documents — no one did a good job of stress-testing them on sample complexity.

We all know that LLMs have the power of generalisability but how do they actually compare to the likes of BERT and Distilbert that have become household names in the world of NLP. Can these LLMs compare with them on tasks beyond chat? Like classification, Named entity recognition, etc?

If you go over to a model folder, let’s say Flan or Falcon, you will notice that the README has a rich documentation of our research findings. This, I guarantee you, you won’t find anywhere else. Additionally, the inference section has a good study of how these models fare when the number of requests go up, and the associated costs.

I will end by saying that a lot of people and repositories are just riding the wave of the buzz surrounding LLMs without answering a lot of questions that data scientists and ML engineers actually have. And those questions (4 pillars of evaluation framework) are necessary to answer for enterprises to build software. Not just slap together a chat interface, and then calling a revolutionary product.