| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by russdill 109 days ago
	I've experimented with having different sized LLMs cooperating. The smaller LLM starts a response while the larger LLM is starting. It's fed the initial response so it can continue it. The idea of having an LLM follow and continuously predict the speaker. It would allow a response to be continually generated. If the prediction is correct, the response can be started with zero latency.

1 comments

Barbing 109 days ago

Google seems to be experimenting with this with their AI Mode. They used to be more likely to send 10 blue links in response to complex queries, but now they may instead start you off with slop.

(Meanwhile at OpenAI: testing out the free ChatGPT, it feels like they prompted GPT 3.5 to write at length based on the last one or maybe two prompts)

link

russdill 109 days ago

This is more of a "Are all the windows closed upstairs?"

"The windows upstairs..."

"...are all closed except for the bedroom window"

The first portion of the response requires a couple of seconds to play but only a few tens of milliseconds to start streaming using a small model. Currently I just break the small model's response off at whatever point will produce about enough time to spin up the larger model.

But all responses spin up both models.

link

Barbing 109 days ago

Whoa, that thing's fast. Very nice! Bet that's fun to play with, least probably fun the first time you saw it working :)

link