ChatGPT's web interface has two, one is triggered by a moderation endpoint API call which scolds you and another one is hardcoded as a regex type filter for copyright which forcibly closes the pipe from the LLM instantly and doesn't acknowledge that something happened. It's hardcoded because a translation to another language or a typo inserted into the output avoids it.
You can get this (or at least could) by asking for the opening of tale of two cities (a public domain work!)
The API (at least via playground) now also has scolding built in, which triggers sometimes when you're just playing around with settings like high temp, because the model can devolve into a mess of all sorts of nonsense text, as is teh nature of transformers, but it doesn't censor it.
The funny thing is that the "plz delete" messages have to be executed by the browser javascript. So in theory, you should be able to capture the "deleted" messages by keeping the network tab open or recording the traffic, right?
Edit: Last time I checked, ChatGPTs web interface was using server-sent events to stream the response words. The events were clearly visible in the network tab if you opened it early enough. So if it sends "delete" messages, they should show up in there.
You can get this (or at least could) by asking for the opening of tale of two cities (a public domain work!)
The API (at least via playground) now also has scolding built in, which triggers sometimes when you're just playing around with settings like high temp, because the model can devolve into a mess of all sorts of nonsense text, as is teh nature of transformers, but it doesn't censor it.