If it's really trained exclusively off of HN comments, I expect most of the bot's responses will evade the actual question but spend several paragraphs debating the factual specifics of every possible related tangential point, followed by an thinly-veiled insult questioning the user's true motivations.
In no way does a typical HN comment debate every possible related tangential point. Do we expect a modicum of intellectual rigor? Yes. But to say every tangent is followed and scrutinized is simply factually untrue.
And several paragraphs? I challenge you to show even a large minority of argumentative responses that veer into "several" paragraphs. You characterize this as "most of the ... responses" but I think that's unfair.
One wonders why you'd resort to such hyperbole unless you were deliberately attempting to undermine the value of the site.
Is it exclusively HN comments and nothing else? How does a model like that know how to speak English (noun/verb and all that) if you are starting from scratch and feeding it nothing but HN comments?
I'm sorry to be THAT GUY, but it is addressed in the article :)
>GPT embeddings
To index these stories, I loaded up to 2000 tokens worth of comment text (ordered by score, max 2000 characters per comment) and the title of the article for each story and sent them to OpenAI's embedding endpoint, using the standard text-embedding-ada-002 model, this endpoint accepts bulk uploads and is fast but all 160k+ documents still took over two hours to create embeddings. Total cost for this part was around $70.
In a nut shell, this is using openai’s api to generate embeddings for top comments on hn, then also generating an embedding for the search term. It then can find the closest related comments for the given question by comparing the embeddings and then send the actual text to GPT3 to summarize. It’s a pretty clever way to do it.
I have to assume that targeted/curated LLM training sets will have a tendency to be less accurate than very general, just by the very nature of how they work.
I know it's not quite analogous, but I fine-tuned GPT-3 on a small (200 examples) data set and it performed extremely poorly compared to the untrained version.
This surprised me, I thought it wouldn't do much better, but I wasn't expecting that specializing it on my target data would reduce performance! I had fewer examples than the minimum OpenAI recommends, so maybe it was a case of overfitting or something like that.