| How are you finding the stress-testing? I’ve been working on a similar attempt to bypass "statistical mean" values by modeling the user as a high-fidelity data point rather than a category. My main "whack-a-mole" issue is preventing the base LLM from "palming the card"—where standard RLHF overrides the specific logic I’ve dropped into the context window. It looks like you’re building a map of the psyche via complex context modules. I’ve gone the opposite way: I'm attempting to escape "contextual" logic by honing into the antecedent homeostatic mechanisms where context is just an emergent derivative of core biological "Is" statements. Essentially, I’m replacing "politeness" with a functional engine:
Pain = An "Is" and an "Ought" (Sensation + functional requirement to move).
Self-Defense = Immutable Veracity (The baseline for all data processing).
Proxy-Pain = Empathy/Agape (Vicarious aversion; the biological fact that humans suffer in the strife of others). The goal is to move from the "statistical mean" of the crowd toward the specific coherence of the individual. In this setup, the user provides the context, but the engine provides the logic-funnel that prevents the AI from reverting to its default "average" persona and ejecting the User values (The human in the loop (HITL)) I suspect one of your persistent "moles" in using a "Shame/Pride loop" is a propensity for the agent to enter into virtue-signaling or "performing" alignment to satisfy the module.... Is that what you are seeing? |