| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by m3kw9 1128 days ago
	I’m not understanding how Guidence Accelerating works. It says “ This cuts this prompt's runtime in half vs. a standard generation approach.” and it gives an example of it asking LLM to generate json. I don’t see anywhere how it accelerates anything because it’s a simple json completion call. How can you accelerate that?

2 comments

evanmays 1128 days ago

The interface makes it look simple, but under the hood it follows a similar approach to jsonformer/clownfish [1] passing control of generation back and forth between a slow LLM and relatively fast python

Let's say you're halfway through a generation of a json blob with a name field and a job field and have already generated

  {
    "name": "bob"

At this point, guidance will take over generation control from the model to generate the next text

  {
    "name": "bob",
    "job":

If the model had generated that, you'd be waiting 70 ms per token (informal benchmark on my M2 air). A comma, followed by a newline, followed by "job": is 6 tokens, or 420ms. But since guidance took over, you save all that time.

Then guidance passes control back to the model for generating the next field value.

  {
    "name": "bob",
    "job": "programmer"

programmer is 2 tokens and the closing " is 1 token, so this took 210ms to generate. Guidance then takes over again to finish the blob

  {
    "name": "bob",
    "job": "programmer"
  }

[1] https://github.com/1rgs/jsonformer https://github.com/newhouseb/clownfish Note: guidance is way more general of a tool than these

Edit: spacing

link

m3kw9 1128 days ago

Thanks for the cool response. Would this use a lot more input token if I’m understanding this correctly because you are stopping the generation after a single fill and then generating again and inputing that for another token?

link

alew1 1128 days ago

But the model ultimately still has to process the comma, the newline, the "job". Is the main time savings that this can be done in parallel (on a GPU), whereas in typical generation it would be sequential?

link

sebzim4500 1128 days ago

Yes. If you look at the biggest models on OpenAI and Anthropic apis, the prompt tokens are significantly cheaper than the response tokens.

link

june_twenty 1128 days ago

Thanks for that example. Very helpful

link

jackdeansmith 1128 days ago

By not generating the fixed json structure (brackets, commas, etc...) and skipping the model ahead to the next tokens you actually want to generate, I think

link