Hacker News new | ask | show | jobs
by CharlieDigital 607 days ago
If it's the problem I think it is, the solution is to run two concurrent prompts.

First prompt validates the input. Second prompt starts the actual content generation.

Combine both streams with SSE on the front end and don't render the content stream result until the validation stream returns "OK". In the SSE, encode the chunks of each stream with a stream ID. You can also handle it on the server side by cancelling execution once the first stream ends.

Generally, the experience is good because the validation prompt is shorter and faster to last (and only) token.

The SSE stream ends up like this:

    data: ing|tomatoes
    
    data: ing|basil
    
    data: ste|3. Chop the
I have a writeup (and repo) of the general technique of multi-streaming: https://chrlschn.dev/blog/2024/05/need-for-speed-llms-beyond... (animated gif at the bottom).
3 comments

This doesn't solve the critical problem, which is that you usually can't tell if something is okay until you have context that you don't yet have. This is why even SOTA models will backtrack when you hit the filter—they only realize you're treading into banned territory after a bunch of text has already been generated, including text that already breaks the rules.

This is hard to fix because if you don't wait until you have enough context, you've given your censor a hair trigger.

> Combine both streams with SSE on the front end and don't render the content stream result until the validation stream returns "OK".

Just a note that this particular implementation has the additional problem of not actually applying your validation stream at the API level, which means your service can and will be abused worse than it would be if you combined the streams server-side. You should never rely on client-side validation for security or legal compliance.

That's why I qualified it "general technique" and explicitly mentioned the option of server abort.

For most consumer use cases, it probably doesn't matter if a few tokens leak before the about, especially if they're not rendered.

Tune it to your needs :)

The OP is talking about constraining the response not the input. Granted, in many cases, the input may give some kind of indicator that the large language model may be more prone to generating output that could violate the given constraints but this is not guaranteed by any measure.

As far as I know, there's no way of validating a streamed response until those tokens have already been streamed unfortunately. You could try buffering the stream in larger chunks before displaying them on screen in the hopes that you might be able to catch it earlier, but that's not going to be a great user experience either.

One of the things I love about gen ai is that all it's problems are solved with using more gen ai.