Hacker News new | ask | show | jobs
by namibj 846 days ago
Hmmm, I could think of using it to update a DDPM with a conditioning input as the dataset expands from an RL/online process, without ruining the conditioning mechanism that's only trainable through the actual RL itself.

I.e., self-supervised training is done to produce semantically sensical results, and the RL-trained conditioning input steers to contextually useful results.

(Btw., if anyone has tips on how to not wreck the RL training's effort when updating the base model with the recently encountered semantically-valid training samples that can be used self-supervised, please tell. I'd hate to throw away the RL effort expended to aquire that much taking data for good self-supervised operation. It's already looking fairly expensive...)