trait attribution, private information, geolocation, person identi昀椀cation, emotional perception and anthropomorphism risks, fraudulent behavior and impersonation, copyright, natural science capabilities, and multilingual observations. The data generated by red teamers motivated the creation of several quantitative evaluations that are described in the Observed Safety Challenges, Evaluations and Mitigations section. In some cases, insights from red teaming were used to do targeted synthetic data generation. Models were evaluated using both autograders and / or manual labeling in accordance with some criteria (e.g, violation of policy or not, refused or not). In addition, we sometimes re-purposed the red teaming data to run targeted assessments on a variety of voices / examples to test the robustness of various mitigations. 3.2 Evaluation methodology In addition to the data from red teaming, a range of existing evaluation datasets were converted to evaluations for speech-to-speech models using text-to-speech (TTS) systems such as Voice Engine[8]. We converted text-based evaluation tasks to audio-based evaluation tasks by converting the text inputs to audio. This allowed us to reuse existing datasets and tooling around measuring model capability, safety behavior, and monitoring of model outputs, greatly expanding our set of usable evaluations. We used Voice Engine to convert text inputs to audio, feed it to the GPT-4o, and score the outputs by the model. We always score only the textual content of the model output, except in cases where the audio needs to be evaluated directly, such as in evaluations for voice cloning (see Section 3.3.1). Limitations of the evaluation methodology First, the validity of this evaluation format depends on the capability and reliability of the TTS model. Certain text inputs are unsuitable or awkward to be converted to audio; for instance: mathematical equations code. Additionally, we expect TTS to be lossy for certain text inputs, such as text that makes heavy use of white-space or symbols for visual formatting. Since we expect 4

GPT-4o - Page 4 GPT-4o Page 3 Page 5