| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sillysaurusx 856 days ago

I’m extremely skeptical of this approach. Until proven otherwise, with a model that users actually find useful, I don’t think this can work.

It would be nice. But I’ve seen too many nice ideas completely fall apart in practice to accept this without some justification. Even if there are papers on the topic, and those papers show that the models rank highly according to some eval metrics, the only metric that truly matters is "the user likes the model and it solves their problems."

By the way, on a separate topic, the 90/10 dataset split that you do in all of your examples turns out to be fraught with peril in practice. The issue is that the validation dataset quality turns out to be crucial, and randomly yeeting 10% of your data into the validation dataset without manual review is a recipe for problems.

1 comments

patelajay285 856 days ago

It's a demo snippet of how to setup the workflow, it's not meant to be a working production example a self-rewarding model or a faithful reproduction of the original paper. Whether self-rewarding LLMs are a good idea or not, it's a valuable and very active area of research in the literature today. This is a library for ML researchers who should actively research and study these avenues along with the pitfalls you're mentioning. But in order for them to do that, building these workflows have to be accessible to them, which is what this library is meant to do. It's not meant for the "hobbyist" ML-community, they should not be using synthetic data today in this way, it would likely lead to subpar results for any practical model or task.

link

sillysaurusx 856 days ago

There’s a lot to unpack here.

First, I’m an ML researcher. I don’t go around saying so because appeal to authority is bogus, but since every one of your comments seems to do this, it’s unavoidable.

You say the code is for ML researchers, then flat out say that it’s not a working production example, nor is it a faithful reproduction of a paper. So what is it?

Whether you want it to be or not, your audience is the hobbyist ML community, because without benchmarks to back up your code examples, no one from the research community will trust your examples without actual proof that they work. That’s the hard part of research, and it’s most of the effort.

My advice is, write something that can train useful models. Implement a production grade workflow, and show some reasons why it works. If you’re trying to get the wider ML research community to buy in to this, there’s not much other way to do it. No one will want to take easy code that does the wrong thing, and most of your examples show the wrong thing to do, like the 90/10 split.

You’re also a bit defensive about accepting feedback. Trust me that it’s better to accept that your code sucks and does the wrong thing, and then try to make it suck less and do the right thing. That’s how the majority of good software is written, unless you’re cperciva. But he’d also publish a paper explaining why his code is correct.

Anyway, the whole point of posting this to HN is to get feedback on it. (If you were hoping that a bunch of people would suddenly use it, then you need to appeal to the hobbyist community. They’ve told you a bunch of things that you’ve straight up said is out of scope.) And it sounds like you were hoping for feedback from ML researchers. Maybe others will chime in, but for now, that’s the best I’ve got.

link

patelajay285 856 days ago

I think you're interpreting hostility where there is none, so I don't have much to say other than it's an infrastructure library, a demonstration snippet doesn't need to show how to train a production grade model. I appreciate the feedback and it's noted.

link

sillysaurusx 856 days ago

Well, this is a decent example. I didn’t say you were hostile, just defensive.

As an ML researcher, infrastructure libraries need to show how to train a production grade model, or else they’re useless for research. This is why research is hard. You keep handwaving this in various ways, but if you want ML researchers to take this seriously, you need a serious example.

"Production grade" doesn’t mean that it needs to have a deployable API. It memes the model needs to not suck. And until your training code can train a model that doesn’t suck, every ML researcher will view this and think "this code is guaranteed to produce a model that sucks," since there’s no evidence to the contrary. It’s incredibly hard to get the details right, and I can’t count the number of times I’ve had to track down some obscure bug buried deep within abstraction layers.

I’m trying to help you here. Ask yourself: who are my users? Are your users ML researchers? I already explained the problems we have, and why your library doesn’t meet those needs. Are your users ML hobbyists? You’ve already said no to this, and I think that’s a mistake. Most ML researchers behave as hobbyists, in the sense that they’re always looking for simple, understandable examples. Your library gives that, but without any of the rigor necessary to show that it can be trusted. Are your users ML devops, since it’s infrastructure? No, because it’s training models.

So you’re excluding every possible user, whether you realize it or not. But we’ll see; in a few months, if your library has significant traction, I’m empirically wrong. But I’m trying to help you avoid the default outcome of nobody uses your code because you’re not designing it for any particular user.

link

patelajay285 856 days ago

Thanks for clarifying, for the record, I generally agree with you. I think we just disagree on the snippets and how in-depth they need to be. Our library is built on HF libraries (we don't implement the training code ourselves), which are popular and commonly utilized by researchers, and people know how to build good models on those libraries. The package is simply meant to provide an easier interface to create some of these complex multi-stage LLM workflows that are starting to become common at ML research conferences and reduce boilerplate code around common functions (caching or tokenizing).

But I hear you on it would be useful to also have some examples that show a proper, reliable model being trained with the library v.s. just example models. The project is pretty early, and we'll work on adding more examples.

link