Hacker News new | ask | show | jobs
by Trapais 883 days ago
>Making useful models is the goal.

Sure, training datasets for pythia is useful. The Pile was used in lots of models. However it's hardly relevant that pythia itself was trained on pile. They live separate lives.

Having just weights already allows making results that are incredibly useful(you don't need original dataset for flash attention, or tuning foundation model into the chat model).

Point is: Having both doesn't make released model more useful.

>Do you actually have any experience doing this? Have you ever fine tuned models or tried to change architecture or put a piece of one model into another?

Yes on both finetune and "changing" architecture: with adapters and similar approaches you don't need to retrain everything from scratch after modifying the guts of the original architecture up to your liking, you just need to not stir it up too much. Training on the task at hand is sufficient.

No, I haven't glued parts of existing models together(ensemble doesn't count)