|
|
|
|
|
by slundberg
1128 days ago
|
|
If you want guidance acceleration speedups (and token healing) then you have to use an open model locally right now, though we are working on setting up a remote server solution as well. I expect APIs will adopt some support for more control over time, but right now commercial endpoints like OpenAI are supported through multiple calls. We manage the KV-cache in session based way that allows the LLM to just take one forward pass through the whole program (only generating the tokens it needs to) |
|