Hacker News new | ask | show | jobs
Show HN: Prototype for internalized AI values using shame/pride mechanisms
3 points by renaissancebro 80 days ago
I built a proof of concept for modeling emotional primitives — shame, pride, identity — as an alternative approach to AI alignment. Current approaches are mostly external (constitutional AI, RLHF). I kept thinking about why humans don't do terrible things despite having the capacity, and the answer felt more like internalized values than external rules. Repo here: https://github.com/renaissancebro/AGI-seed
1 comments

How are you finding the stress-testing?

I’ve been working on a similar attempt to bypass "statistical mean" values by modeling the user as a high-fidelity data point rather than a category. My main "whack-a-mole" issue is preventing the base LLM from "palming the card"—where standard RLHF overrides the specific logic I’ve dropped into the context window.

It looks like you’re building a map of the psyche via complex context modules. I’ve gone the opposite way: I'm attempting to escape "contextual" logic by honing into the antecedent homeostatic mechanisms where context is just an emergent derivative of core biological "Is" statements.

Essentially, I’m replacing "politeness" with a functional engine: Pain = An "Is" and an "Ought" (Sensation + functional requirement to move). Self-Defense = Immutable Veracity (The baseline for all data processing). Proxy-Pain = Empathy/Agape (Vicarious aversion; the biological fact that humans suffer in the strife of others).

The goal is to move from the "statistical mean" of the crowd toward the specific coherence of the individual. In this setup, the user provides the context, but the engine provides the logic-funnel that prevents the AI from reverting to its default "average" persona and ejecting the User values (The human in the loop (HITL))

I suspect one of your persistent "moles" in using a "Shame/Pride loop" is a propensity for the agent to enter into virtue-signaling or "performing" alignment to satisfy the module.... Is that what you are seeing?

Honestly this is early proof of concept, I haven't stress tested much beyond getting the mechanisms to run. A lot of the implementation was AI-assisted — I'm more the person who had the idea and kept pushing it in a direction than a deep ML researcher. Your virtue-signaling concern is exactly the kind of thing I don't have an answer to yet, that's part of why I posted it. More looking for people like you who are thinking about this to poke holes in it. I would love to see your approach if you would be willing to share.
Empathy is hard to model in my mind curious how you model that mechanically because it seems to require simulating another agent's internal state which is its own unsolved problem. Shame only needs an internal standard and a deviation measure, all internal. Also wondering if your Is/Ought primitives actually survive the RLHF layer or get overridden the same way context does. My uncertainty module doesn't solve the palming problem either — it's more of a smoke detector than a fix. At least flags when the model is hedging.