Hacker News new | ask | show | jobs
Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing (github.com)
14 points by rsaha7 804 days ago
Hello all!

Very happy to share this toolkit that allows you to fine-tune your choice of open-source LLMs on your data! The toolkit also allows you to run ablation studies across LLMs, prompt designs, training configurations, and can ingest different data files -- all through just ONE YAML file! After fine-tuning, you can also run a bunch of tests to ensure that the fine-tuned LLM behaves as expected, enabling faster time-to-production!

Why this toolkit? Why now?

While closed-source LLMs have become popular for chat-based applications, enterprises are considering a shift to self-hosted SLMs (smaller language models) since there is evidence that you don't need a gigantic model to solve narrow edge-cases. Plus, enterprises want to own the data pipeline from start to end, i.e., data ingestion, training, deployment, feedback collection and testing! Their customers' valuable data stays within their ecosystem, allowing enterprises to not worry about compliance or data leakage issues that come up using third-party APIs.

While there are a few repositories out there that do vanilla fine-tuning, it is well known that it takes more than a one run to find the desirable setting of weights / parameters for your specific data. Bearing this pain-point in mind, we designed the toolkit to allow running multiple experiments through one config file!

Around 5 months ago, I had shared a repository that contained individual fine-tuning scripts for the most popular LLMs. While the repository received great reception from this community, there was one unanimous feedback -- the community wants to build on top of our scripts! This prompted us to design the toolkit, bearing in mind the pain-points that data scientists / researchers / engineers like myself face!

Please feel free to give it a try! Looking forward to your feedback!

12 comments

This is a great project, little bit similar to https://github.com/ludwig-ai/ludwig, but it includes testing capabilities and ablation.

questions regarding the LLM testing aspect: How extensive is the test coverage for LLM use cases, and what is the current state of this project area? Do you offer any guarantees, or is it considered an open-ended problem?

Would love to see more progress toward this direction!

Thanks for the feedback! Yes, it is similar to ludwig but we do think that our toolkit is a more lightweight solution to fine-tuning and ablation studies. In most cases, finding the right LLM with the right config on your dataset requires multiple runs (grid search). Our toolkit offers this capability via one yaml file.

As for the test coverage, right now, the toolkit includes property-based unit tests. For instance, for an LLM fine-tuned on summarization, a property-test will evaluate if the summarized text is smaller in length compared to the actual input text.

Similar to the above test, we have a handful of property-based tests. Of course, the list is not exhaustive at this time. As more progress is being made on the testing side, we aim to distill the most relevant tests depending on use-cases.

Hope this helps.

Sounds like a great library to use for automatically testing all the new models being released everyday and finding out if a new open source model significantly performs better on your custom dataset.

1. What's the largest model (number of parameters) that you've tested the library with?

2. Will MoE models work as well? They're known to have more unstable training and need some custom techniques to stabilize

Thanks for the feedback!

1. The largest model that we have tested is Llama2 13B. For the first phase, we focussed on fine-tuning LLMs in the 1B-13B range. For our next phase, we will focus on 13B-45B'ish -- for this we will have to incorporate distributed techniques.

2. Following incorporation of distributed training techniques, we will be able to run MoE based models, such as Mixtral.

This tool looks super interesting! I had a question- You mentioned the ability to run tests to ensure the fine-tuned model behaves as expected. What evaluation metrics are built in, and how customizable is the evaluation pipeline? Can I easily add my own metrics?
Also worth noting that the toolkit comes with 3 settings:

1. Basic - set up your first simple fine-tuning experiment 2. Intermediate - Create custom config files for specialized fine-tuning experiments 3. Advanced - Run ablation studies through the same config file by defining various setting!

Just tried this out and got the default working in a few minutes! Would love to see more support to use finetuning dataset format used for OpenAI and handling for history.
Thanks for the feedback! Glad you got the default setting working quickly!

Right now, we are focussed mostly on offering support for open-source models but we can definitely extend support for OpenAI formats.

May I ask what history means?

This is a great project! But I was wondering how frequently will the library be updated with the new different optimization techniques which keep coming out?
Thanks for the feedback! The goal is to offer new techniques via our toolkit as soon as they become available on HuggingFace. To that end, we are aiming to move fast and bring those techniques to the toolkit at the earliest post release.
What's the roadmap for this library? It seems like there are already a couple packages that do similar thing -- what's the main differentiator for this?
Great question.

Right now, the roadmap includes extending the training optimizer sections to include techniques beyond LoRA.

Furthermore, the testing suite will be extended to add more unit-tests that are task dependent.

I know that other repositories exist with similar functionalities but they can be too low level for the day-to-day data scientist to understand. Also, there are several repositories that are too specific for either testing, fine-tuning, etc. Our repository consolidates the most critical aspects of running fine-tuning experiments while being lightweight for anyone to understand and play with.

Is there support for UI? I know there are many repositories supporting UI functionalities that makes it easier to experiment with different LLMs.
The toolkit does not support UI at this time.

We focussed on simplifying the experimentation experience that a data scientist / engineer typically go through.

For instance, if you want to find the best LLM with the best configuration for your dataset, then ideally, we would like to run an ablation study (think grid search over learning rate, number of epochs, etc.). It would be challenging to show this progress over an UI.

The ideal user of the toolkit would set all the experimentation details in a config file, and then run it via the terminal -- come back to it after a day or so, depending on how big the search space is.

I have been meaning to explore fine-tuning llms on my own dataset. Which formats does this toolkit support?
You can fine-tune on your own dataset! As long as your dataset is in one of json, csv or huggingface formats, our toolkit can ingest your data!
is the finetuning more like "instruct-finetuning" ? I would like to try it out for a sample usecase I have but to proceed it would be better to know the fundamentals behind it
Cool project. I hope it expands to other forms of tuning LLMs
Thanks for the feedback!

The goal is to extend the training optimization techniques to beyond LoRA / QLoRA :)

Happy to have you join our team!

Which LLMs are supported in this toolkit?
The toolkit supports open-source LLMs that are available on HuggingFace.

So, that would include Llama2, Falcon, Mistral and the likes.