| I think this misses some of the core problems and it suggests there are some more straight forward solutions. We have no solutions to this and the way we're treating this means we aren't going to come up with solutions. Problem 1: Training Using any method like RLHF, DPO, or such guarantees that we train our models to be deceptive. This is because our metric is the Justice Potter metric: I know it when I see it. Well, you're assuming that this accurate. The original case was about defining porn and well... I don't think it is hard to see how people even disagree on this. Go on Reddit and ask if girls in bikinis are safe for work or not. But it gets worse. At times you'll be presented with the choice between two lies. One lie you know is a lie and the other lie you don't know it is. So which do you choose? Obviously the latter! This means we optimize our models to deceive us. This is true too when we come to the choice between truth and a lie we do not know is a lie. They both look like truths. This will be true even in completely verifiable domains. The problem comes down to truth not having infinite precision. A lot of truth is contextually dependent. Things often have incredible depth, which is why we have experts. As you get more advanced those nuances matter more and more. Problem 2: Metrics and Alignment All metrics are proxies. No ifs, ands, or buts. Every single one. You cannot obtain direct measurements which are perfectly aligned with what you intend to measure. This can be easily observed with even simple forms of measurements like measuring distance. I studied physics and worked as an (aerospace) engineer prior to coming to computing. I did experimental physics, and boy, is there a fuck ton more complexity to measuring things than you'd guess. I have a lot of rules, calipers, micrometers and other stuff at my house. Guess what, none of them actually agree on measurements. They all are pretty close, but they do differ within their marked precision levels. I'm not talking about my ruler with mm hatch marks being off by <1mm, but rather >1mm. RobertElderSoftware illustrates some of this in this fun video[0]. In engineering, if you send a drawing to a machinist and it doesn't have tolerances, you have actually not provided them measurements. In physics, you often need to get a hell of a lot more nuanced. If you want to get into that, go find someone that works in an optics lab. Boy does a lot of stuff come up that throws off your measurements. It seems straight forward, you're measuring distances. This gets less straightforward once we talk about measuring things that aren't concrete. What's a high fidelity image? What is a well written sentence? What is artistic? What is a good science theory? None of these even have answers and are highly subjective. The result of that is your precision is incredibly low. In other words, you have no idea how you align things. It is fucking hard in well defined practical areas, but the stuff we're talking about isn't even close to well defined. I'm sorry, we need more theory. And we need it fast. Ad hoc methods will get you pretty far, but you'll quickly hit a wall if you aren't pushing the theory alongside it. The theory sits invisible in the background, but it is critical to advancements. We're not even close to figuring this shit out... We don't even know if it is possible! But we should figure out how to put bounds, because even bounding the measurements to certain levels of error provides huge value. These are certainly possible things to accomplish, but we aren't devoting enough time to them. Frankly, it seems many are dismissive. But you can't discuss alignment without understanding these basic things. It only gets more complicated, and very fast. [0] https://www.youtube.com/watch?v=EstiCb1gA3U |