Hacker News new | ask | show | jobs
by minimaxir 1518 days ago
The dataset they used is public, the models and the training process they used are not.

https://areyoutheasshole.com/about https://areyoutheasshole.com/training-data

1 comments

> pushshift.io, a website and database which logs of all of the posts that go on Reddit when they get posted

Such a great resource. It's surprisingly easy to build your own massive datasets using it. I re-derived WebText2, used for training GPT-3, just on a home machine. And with some image scraping you can build up image datasets for training interesting GAN models.

> the training process they used are not.

Seems like it'd be fairly straightforward to finetune an existing language model . GPT-3 if you've got spare change, GPT-J-6B can be finetuned in Colab for free, and GPT-NeoX-20B could be finetuned for free/cheap. Use simple concats of AITA posts followed by a top comment. Balance for NTA/YTA like the Training Data page mentions, and I'll bet you'll get comparable results.

That said, the _idea_ of this bot is really cool and fun.

Straightfoward to tune, but given the dataset size it would require a substantial amount of compute, more than what a Colab can provide without timing out.

The comments by the creators imply they used some sort of SaaS for both training and deployment.