| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by retube 1254 days ago
	I am guessing they already were? But this is 100% pure, concentrated HN not contaminated with nonsense from the rest of the web :)

3 comments

bityard 1254 days ago

If it's really trained exclusively off of HN comments, I expect most of the bot's responses will evade the actual question but spend several paragraphs debating the factual specifics of every possible related tangential point, followed by an thinly-veiled insult questioning the user's true motivations.

link

heleninboodler 1254 days ago

In no way does a typical HN comment debate every possible related tangential point. Do we expect a modicum of intellectual rigor? Yes. But to say every tangent is followed and scrutinized is simply factually untrue.

And several paragraphs? I challenge you to show even a large minority of argumentative responses that veer into "several" paragraphs. You characterize this as "most of the ... responses" but I think that's unfair.

One wonders why you'd resort to such hyperbole unless you were deliberately attempting to undermine the value of the site.

link

GreenWatermelon 1254 days ago

This is my favorite type of humour.

link

Aromasin 1254 days ago

If you're not arguing over the semantics, rather than OP's clear-enough intent, are you really on HN?

link

jb1991 1254 days ago

That had me laughing! Case in point, from a few days ago: https://news.ycombinator.com/item?id=34855372

link

rocho 1253 days ago

It's not trained at all. The bot finds relevant comments and then uses OpenAI's API to summarize them.

link

MuffinFlavored 1254 days ago

Is it exclusively HN comments and nothing else? How does a model like that know how to speak English (noun/verb and all that) if you are starting from scratch and feeding it nothing but HN comments?

link

neoromantique 1254 days ago

I'm sorry to be THAT GUY, but it is addressed in the article :)

>GPT embeddings

To index these stories, I loaded up to 2000 tokens worth of comment text (ordered by score, max 2000 characters per comment) and the title of the article for each story and sent them to OpenAI's embedding endpoint, using the standard text-embedding-ada-002 model, this endpoint accepts bulk uploads and is fast but all 160k+ documents still took over two hours to create embeddings. Total cost for this part was around $70.

link

gorbypark 1254 days ago

In a nut shell, this is using openai’s api to generate embeddings for top comments on hn, then also generating an embedding for the search term. It then can find the closest related comments for the given question by comparing the embeddings and then send the actual text to GPT3 to summarize. It’s a pretty clever way to do it.

link

nkozyra 1254 days ago

> How does a model like that know how to speak English

Mimicry.

link

nkozyra 1254 days ago

I have to assume that targeted/curated LLM training sets will have a tendency to be less accurate than very general, just by the very nature of how they work.

(edited for clarity)

link

andai 1254 days ago

I know it's not quite analogous, but I fine-tuned GPT-3 on a small (200 examples) data set and it performed extremely poorly compared to the untrained version.

This surprised me, I thought it wouldn't do much better, but I wasn't expecting that specializing it on my target data would reduce performance! I had fewer examples than the minimum OpenAI recommends, so maybe it was a case of overfitting or something like that.

link