Hacker News new | ask | show | jobs
by marci 765 days ago
Maybe that's not the right metrics to compare.

True, the model is bigger, but required less tokens than Llama 3 to train. The issue is when there's no open datasets, it's hard to really compare and replicate. Is it because of the model's architecture? Dataset quality? Model size? A mixture of those? Something else?

1 comments

> True, the model is bigger, but required less tokens than Llama 3 to train.

That…doesn’t matter to users. User’s care what it can do, and what it requires for them to use it, not what it took for you to make it.

Sure, if it has better performance relative to training set size that’s interesting from a scientific perspective and learning about how to train models, maybe, if it scales the same as other models in that regard. But ultimately, for use, until you get to a model that does better absolutely, or does better relevant to models with the same resource demands, you aren’t offering an advantage.

I understand, I'm just glad for the possible implications for future models: less expensive to make => less expensive to iterate. MoE are cheaper to train. My favorite right now is Wizard 8x22b, so as a random user, I don't really care about this model. Will probably never run it as-is. But makes me hope for a Falcon-MoE.

Also, the fact that it's less dense than llama 3 means there may be more room for lora fine-tuning, and at a lesser cost than required for llama 3 while sacrificing way less of its smarts. That may be my use.