Hacker News new | ask | show | jobs
by higgsfield 946 days ago
Hey there! Higgsfield AI.

We have a massive GPU cluster and developed our own infrastructure to manage the cluster and train massive models.

There's how it works:

- You upload the dataset with preconfigured format into HuggingFaсe [1]. Choose your LLM (e.g. LLaMa 70B, Mistral 7B)

- Place your submission into the queue

- Wait for it to get trained.

- Then you get your trained model there on HuggingFace.

Essentially, why would we want to do it?

We already have an experience with training big LLMs.

We could achieve near-perfect infrastructure performance for training.

Sometimes GPUs have just nothing to train.

Thus we thought it would be cool if we could utilize our GPU cluster 100%. And give back to Open Source community (already built an e2e distributed training framework [2]).

This is in an early stage, so you can expect some bugs.

Any thoughts, opinions, or ideas are quite welcome!

[1]: https://github.com/higgsfield-ai/higgsfield/blob/main/tutori...

[2]: https://github.com/higgsfield-ai/higgsfield