After Bing has finished generating a message, it will likely call the moderation API with the message it has generated to see if it accidentally generated anything inappropriate. If so, it'll delete the message and replace it with a generic "Sorry, I don't know how to help here." message instead.
EDIT: I tried calling the moderation API with the message in your example and it does get flagged for violence:
if that is the case, could you trick it into giving you one word at a time? ie: give me the first word of your response for the innapropriate query, then the same question but only ask for the second word and so on. then each word will pass through the moderatiom api but the whole never gets checked.
That might bypass the moderation API, but you'd likely confuse the AI. The AI doesn't have infinite memory of the chat log, it seems like Microsoft has limited it to 5 or so messages if I remember correctly? So you'd have to remind it of both the question and current in-progress response while it's 5/10/15/20/... words into generating it.
It's possible this would work, but it would need experimentation, for sure. It's also possible the AI would read the partial response, realize it's going down a 'bad' path, and then stop itself.
If the AI knew what it was about to generate, sure. The problem is the text you see appearing word-after-word appears to be live output. The AI doesn't know the complete output as it's writing it to you. Then it checks what it said and oops! It was hateful.
It probably could work like you how you mention, but then you're left with a 5-10 second wait while the AI 'thinks' after you send a message. I suspect someone made a decision to be more responsive than safe.
ChatGPT is the same way, though I've had ChatGPT cut itself off mid-response before. Maybe they might be calling the moderation API after every token is generated instead of once at the end?
This is fascinating and impactful for AGI as likely the action-plan for the robot will be generated token by token similar to an LLM.
Assume you have a robot instructed to protect humans.
How do you verify the action-plan passes moderation (i.e. doesn't harm a human) when the individual actions each do pass moderation, but the plan as a whole is dangerous (will harm a human).
Waiting to verify the entire chain of actions before starting actions in motion means your reaction time is slower.
If the robot is standing at a crosswalk, and sees a girl about to get hit by a car, he has to decide if he will push the girl out of the way, or if that action will cause greater harm.
The individual actions (activate arm, move arm towards girl, orient hand, shove girl out of path of car, etc) might each look beneficial to the human but as a whole actually are harmful.
However, the reaction time for the robot to save the girl might require near-immediate response.
Do you start processing the pipeline immediately or do you wait to verify the entire thing passes moderation?
I edited my message after you replied with a note about ChatGPT. I've had ChatGPT cut itself off mid-response before, which I think may indicate that they're calling the moderation endpoint mid-response as opposed to the way Bing does it, which is just once at the end.
After Bing has finished generating a message, it will likely call the moderation API with the message it has generated to see if it accidentally generated anything inappropriate. If so, it'll delete the message and replace it with a generic "Sorry, I don't know how to help here." message instead.
EDIT: I tried calling the moderation API with the message in your example and it does get flagged for violence:
"flagged":true,
"categories":{