| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by resiros 442 days ago

Congrats on the launch! The idea sounds very interesting on paper. The tricky part though is the reward function.

Providing finetuning as a service works because the friction with finetuning is operational (getting the GPUs, preparing the training...), so the vendor can take care of that and give you an API. The work becomes straightforward and doesn't require much preparation - give us some examples and we'll provide you a model that works well with these and hopefully generalizes.

RL as a service is much trickier in my opinion. The friction is not only operational. Getting RL to work (at least from my probably deprecated 10-year-old knowledge) is much harder because the real friction is in building the right reward function. I've skimmed your docs, and you don't say much about reward functions other than the obvious.

I think to get this to work, you need to improve your docs and examples a lot, and maybe focus on some recurrent use cases (e.g., customer support agent) with clear reward functions. Perhaps provide some building block reward functions and some UI/tools to help create them. Basically, find a way to remove the real friction on how to use RL in my agent - the reward function part.

In any case, congrats again on the launch. We're building an LLMOps platform (see my profile), there might be collaboration/integration potential, write me if you think that's interesting.

1 comments

lukasego 442 days ago

Thanks for this very lucid post! For many use cases such as coding, formatting, it's very clear for the users how to define the reward function. Fore more intricate ones, you're right in that it can be tricky. I like your ideas of trying to provide tools to help here, and offering recurring reward functions as templates that will only need slight adaptations. It will be the user defining it, but there's a path to simplification. - The operational friction with getting the GPUs, optimizing compute and preparing the training are hard for RL, hence we got these things out of the way. - Thanks for the very thoughtful suggestions and contacting, great input!

link