| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by oersted 192 days ago

Impressive work, but I'm confused on a number of fronts:

- You are serving closed models like Claude with your CTGT policy applied, yet, the way you described your method, it involves modifying internal model activations. Am I misunderstanding something here?

- Could you bake the activation interventions into the model itself rather than it being a runtime mechanism?

- Could you share the publications of the research associated with this? You stated it comes from UCSD.

- What exactly are you serving in the API? Did you select a whitelist of features to suppress you thought would be good? Which ones? Is it just the "hallucination" direction that you showcase in the benchmark? I see some vague personas, but no further control other than that. It's quite black-boxy the way you present it right now.

I don't mean this as a criticism, this looks great, I just want to understand what it is a bit better.

1 comments

cgorlla 192 days ago

>yet, the way you described your method, it involves modifying internal model activations

It's a subtlety, but part of it works on API based models, from the post:

"we combine this with a graph verification pipeline (which works on closed weight models)"

The graph based policy adjudication doesn't need access to the model weights.

>Could you bake the activation interventions into the model itself rather than it being a runtime mechanism?

You could via RFT or similar on the outputs. It functions as a layer on top of the model without affecting the underlying weights, so the benefit is that it does not create another artifact for a given customization.

>What exactly are you serving in the API?

It's the base policy configuration that created the benchmark results, along with various personas to give users an idea of how uploading a custom policy would work.

For industry-specific deployments, we have additional base policies that we deploy for that vertical, so this is meant to simulate that aspect of the platform.

link

oersted 192 days ago

> graph based policy adjudication

What do you mean by this? Does the method involve playing with output token probabilities? Or modifying the prompt? Or blocking bad outputs?

> how uploading a custom policy would work

Do you have more info on this? Is this something you offer already or something you are planning? How would policies be defined, as a prompt? As a dataset of examples?

link

cgorlla 192 days ago

We create a policy hierarchy with a graph structure, based on certain elements of generative content coming in to our system, as well as what we know about the application where it's deployed.

The main benefit is we can traverse this graph deterministically when evaluating content and determine which policies need to be applied (if any) in a more rigorous manner than just, say, stuffing 900 FINRA rules into a prompt.

On custom policies, yes, this is core functionality of our deployed product. This typically looks like PDFs, doc files, or even Slack transcripts with relevant business info. The policy engine discretizes these into tone, forbidden words, key phrases etc. that form the elements of the aforementioned graph.

link

KTibow 192 days ago

Okay, but what does "applied" look like? Including a prompt?

link