Safety Behavior: We evaluate on an internal dataset of conversations and evaluate the consis- tency of the model’s adherence and refusal behavior across di昀昀erent user voices. Overall, we do not 昀椀nd that the model behavior varies across di昀昀erent voices. 9

GPT-4o - Page 9 GPT-4o Page 8 Page 10