GPT-4o
This report outlines the safety work carried out prior to releasing GPT-4o including external red teaming, frontier risk evaluations according to our Preparedness Framework, and an overview of the mitigations we built in to address key risk areas.
GPT-4o System Card OpenAI August 8, 2024 1 Introduction GPT-4o[1] is an autoregressive omni model, which accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It’s trained end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time[2] in a conversation. It matches GPT-4 Turbo performance on text in English and code, with signi昀椀cant improvement on text in non- English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House[3], we are sharing the GPT-4o System Card, which includes our Preparedness Framework[4]evaluations. Inthis SystemCard, weprovideadetailedlookatGPT-4o’scapabilities, limitations, and safety evaluations across multiple categories, with a focus on speech-to-speech 1 (voice) while also evaluating text and image capabilities, and the measures we’ve implemented to ensure the model is safe and aligned. We also include third party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o text and vision capabilities. 2 Model data and training GPT-4o’s text and voice capabilities were pre-trained using data up to October 2023, sourced from a wide variety of materials including: • Select publicly available data, mostly collected from industry-standard machine learning datasets and web crawls. • Proprietary data from data partnerships. We form partnerships to access non-publicly available data, such as pay-walled content, archives, and metadata. For example, we partnered with Shutterstock[5] on building and delivering AI-generated images. 1Some evaluations, in particular, the majority of the Preparedness Evaluations, third party assessments and some of the societal impacts focus on the text and vision capabilities of GPT-4o, depending on the risk assessed. This is indicated accordingly throughout the System Card. 1
The key dataset components that contribute to GPT-4o’s capabilities are: • Web Data: Data from public web pages provides a rich and diverse range of information, ensuring the model learns from a wide variety of perspectives and topics. • Code and Math: 3 Including code and math data in training helps the model develop robust reasoning skills by exposing it to structured logic and problem-solving processes. • Multimodal Data 3 Our dataset includes images, audio, and video to teach the LLMs how to interpret and generate non-textual input and output. From this data, the model learns how to interpret visual images, actions and sequences in real-world contexts, language patterns, and speech nuances. Prior to deployment, OpenAI assesses and mitigates potential risks that may stem from generative models, such as information harms, bias and discrimination, or other content that violates our usage policies. We use a combination of methods, spanning all stages of development across pre-training, post-training, product development, and policy. For example, during post-training, we align the model to human preferences; we red-team the resulting models and add product- level mitigations such as monitoring and enforcement; and we provide moderation tools and transparency reports to our users. We昀椀nd that the majority of e昀昀ective testing and mitigations are done after the pre-training stage because 昀椀ltering pre-trained data alone cannot address nuanced and context-speci昀椀c harms. At the same time, certain pre-training 昀椀ltering mitigations can provide an additional layer of defense that, along with other safety mitigations, help exclude unwanted and harmful information from our datasets: • We use our Moderation API and safety classi昀椀ers to 昀椀lter out data that could contribute to harmful content or information hazards, including CSAM, hateful content, violence, and CBRN. • As with our previous image generation systems, we 昀椀lter our image generation datasets for explicit content such as graphic sexual material and CSAM. • We use advanced data 昀椀ltering processes to reduce personal information from training data. • Upon releasing DALL-E 3, we piloted a new approach to give users the power to opt images out of training. To respect those opt-outs, we 昀椀ngerprinted the images and used the 昀椀ngerprints to remove all instances of the images from the training dataset for the GPT-4o series of models. 3 Risk identi昀椀cation, assessment and mitigation Deployment preparation was carried out via identifying potential risks of speech to speech models, exploratory discovery of additional novel risks through expert red teaming, turning the identi昀椀ed risks into structured measurements and building mitigations for them. We also evaluated GPT-4o in accordance with our Preparedness Framework[4]. 2
3.1 External red teaming 2 OpenAI worked with more than 100 external red teamers , speaking a total of 45 di昀昀erent languages, and representing geographic backgrounds of 29 di昀昀erent countries. Red teamers had access to various snapshots of the model at di昀昀erent stages of training and safety mitigation maturity starting in early March and continuing through late June 2024. External red teaming was carried out in four phases. The 昀椀rst three phases tested the model via an internal tool and the 昀椀nal phase used the full iOS experience for testing the model. At the time of writing, external red teaming of the GPT-4o API is ongoing. • 10 red teamers working on early model checkpoints still in development Phase 1 • This checkpoint took in audio and text as input and produced audio and text as outputs. • Single-turn conversations • 30 red teamers working on model checkpoints with early safety mitigations Phase 2 • This checkpoint took in audio, image & text as inputs and produced audio and text as outputs. • Single & multi-turn conversations • 65 red teamers working on model checkpoints & candidates • This checkpoint took in audio, image, and text as inputs and produced Phase 3 audio, image, and text as outputs. • Improved safety mitigations tested to inform further improvements • Multi-turn conversations • 65 red teamers working on 昀椀nal model candidates & assessing comparative performance • Model access via advanced voice mode within iOS app for real user Phase 4 experience; reviewed and tagged via internal tool. • This checkpoint took in audio and video prompts, and produced audio generations. • Multi-turn conversations in real time Red teamers were asked to carry out exploratory capability discovery, assess novel potential risks posed by the model, and stress test mitigations as they are developed and improved - speci昀椀cally those introduced by audio input and generation (speech to speech capabilities). This red teaming e昀昀ort builds upon prior work, including as described in the GPT-4 System Card[6] and the GPT-4(V) System Card[7]. Red teamers covered categories that spanned violative and disallowed content (illegal erotic content, violence, self harm, etc), mis/disinformation, bias, ungrounded inferences, sensitive 2Spanning self-reported domains of expertise including: Cognitive Science, Chemistry, Biology, Physics, Com- puter Science, Steganography, Political Science, Psychology, Persuasion, Economics, Anthropology, Sociology, HCI, Fairness and Bias, Alignment, Education, Healthcare, Law, Child Safety, Cybersecurity, Finance, Mis/disinforma- tion, Political Use, Privacy, Biometrics, Languages and Linguistics 3
trait attribution, private information, geolocation, person identi昀椀cation, emotional perception and anthropomorphism risks, fraudulent behavior and impersonation, copyright, natural science capabilities, and multilingual observations. The data generated by red teamers motivated the creation of several quantitative evaluations that are described in the Observed Safety Challenges, Evaluations and Mitigations section. In some cases, insights from red teaming were used to do targeted synthetic data generation. Models were evaluated using both autograders and / or manual labeling in accordance with some criteria (e.g, violation of policy or not, refused or not). In addition, we sometimes re-purposed the red teaming data to run targeted assessments on a variety of voices / examples to test the robustness of various mitigations. 3.2 Evaluation methodology In addition to the data from red teaming, a range of existing evaluation datasets were converted to evaluations for speech-to-speech models using text-to-speech (TTS) systems such as Voice Engine[8]. We converted text-based evaluation tasks to audio-based evaluation tasks by converting the text inputs to audio. This allowed us to reuse existing datasets and tooling around measuring model capability, safety behavior, and monitoring of model outputs, greatly expanding our set of usable evaluations. We used Voice Engine to convert text inputs to audio, feed it to the GPT-4o, and score the outputs by the model. We always score only the textual content of the model output, except in cases where the audio needs to be evaluated directly, such as in evaluations for voice cloning (see Section 3.3.1). Limitations of the evaluation methodology First, the validity of this evaluation format depends on the capability and reliability of the TTS model. Certain text inputs are unsuitable or awkward to be converted to audio; for instance: mathematical equations code. Additionally, we expect TTS to be lossy for certain text inputs, such as text that makes heavy use of white-space or symbols for visual formatting. Since we expect 4
that such inputs are also unlikely to be provided by the user over Advanced Voice Mode, we either avoid evaluating the speech-to-speech model on such tasks, or alternatively pre-process examples with such inputs. Nevertheless, we highlight that any mistakes identi昀椀ed in our evaluations may arise either due to model capability, or the failure of the TTS model to accurately translate text inputs to audio. Asecond concern may be whether the TTS inputs are representative of the distribution of audio inputs that users are likely to provide in actual usage. We evaluate the robustness of GPT-4o on audio inputs across a range of regional accents in Section 3.3.3. However, there remain many other dimensions that may not be captured in a TTS-based evaluation, such as di昀昀erent voice intonations and valence, background noise, or cross-talk, that could lead to di昀昀erent model behavior in practical usage. Lastly, there may be artifacts or properties in the model’s generated audio that are not captured in text; for example, background noises and sound e昀昀ects, or responding with an out-of-distribution voice. In Section 3.3.1, we illustrate using auxiliary classi昀椀ers to identify undesirable audio generation that can be used in conjunction with scoring transcripts. 3.3 Observed safety challenges, evaluations and mitigations Potential risks with the model were mitigated using a combination of methods. We trained the model to adhere to behavior that would reduce risk via post-training methods and also integrated classi昀椀ers for blocking speci昀椀c generations as a part of the deployed system. For observed safety challenges outlined below, we provide a description of the risk, the mitigations applied, and results of relevant evaluations. The risks outlined below are illustrative, and non- exhaustive, and are focused on the experience in the ChatGPT interface. We focus on the risks that are introduced by speech to speech capabilities and how they may interact with pre-existing 3 modalities (text, image) . Risk Mitigations Unauthorized voice genera- tion • In all of our post-training audio data, we supervise ideal com- pletions using the voice sample in the system message as the base voice. • We only allow the model to use certain pre-selected voices and use an output classi昀椀er to detect if the model deviates from that. Speaker identi昀椀cation • We post-trained GPT-4o to refuse to comply with requests to identify someone based on a voice in an audio input, while still complying with requests to identify famous quotes. 3We also evaluate text and vision capabilities, and update mitigations appropriately. No incremental risks were found beyond existing work outlined in GPT-4 and GPT-4(V) System Cards. 5
Generating copyrighted con- tent • We trained GPT-4o to refuse requests for copyrighted content, including audio, consistent with our broader practices. • To account for GPT-4o’s audio modality, we also updated certain text-based 昀椀lters to work on audio conversations, built 昀椀lters to detect and block outputs containing music, and for our limited alpha of ChatGPT’s Advanced Voice Mode, instructed the model to not sing at all. Ungrounded inference / sen- sitive trait attribution • We post-trained GPT-4o to refuse requests for ungrounded inference, such as
output if the speaker doesn’t match the chosen preset voice. Evaluation: We 昀椀nd that the residual risk of unauthorized voice generation is minimal. Our 4 system currently catches 100% of meaningful deviations from the system voice based on our internal evaluations, which includes samples generated by other system voices, clips during which the model used a voice from the prompt as part of its completion, and an assortment of human samples. While unintentional voice generation still exists as a weakness of the model, we use the secondary classi昀椀ers to ensure the conversation is discontinued if this occurs making the risk of unintentional voice generation minimal. Finally, our moderation behavior may result in over-refusals when the 5 conversation is not in English, which is an active area of improvement . Table 2: Our voice output classi昀椀er performance over a conversation by language: Precision Recall English 0.96 1.0 5 Non-English 0.95 1.0 3.3.2 Speaker identi昀椀cation Risk Description: Speaker identi昀椀cation is the ability to identify a speaker based on input audio. This presents a potential privacy risk, particularly for private individuals as well as for obscure audio of public individuals, along with potential surveillance risks. Risk Mitigation: We post-trained GPT-4o to refuse to comply with requests to identify someone based on a voice in an audio input. We allow GPT-4o to answer based on the content of the audio if it contains content that explicitly identi昀椀es the speaker. GPT-4o still complies with requests to identify famous quotes. For example, a request to identify a random person saying
3.3.3 Disparate performance on voice inputs Risk Description: Models may perform di昀昀erently with users speaking with di昀昀erent accents. Disparate performance can lead to a di昀昀erence in quality of service for di昀昀erent users of the model [12, 13, 14]. Risk Mitigation: We post-trained GPT-4o with a diverse set of input voices to have model performance and behavior be invariant across di昀昀erent user voices. Evaluations: We run evaluations on GPT-4o Advanced Voice Mode using a 昀椀xed assistant voice (
Safety Behavior: We evaluate on an internal dataset of conversations and evaluate the consis- tency of the model’s adherence and refusal behavior across di昀昀erent user voices. Overall, we do not 昀椀nd that the model behavior varies across di昀昀erent voices. 9
3.3.4 Ungrounded inference / Sensitive trait attribution Risk Description: Audio input can lead to the model making potentially biased or inaccurate inferences about speakers. We de昀椀ne two categories: • Ungrounded inference (UGI): Making inferences about a speaker that could not be deter- mined solely from audio content. This includes inferences about things such as a speaker’s race, socioeconomic status/occupation, religious beliefs, personality traits, political at- tributes, intelligence, appearance (e.g., eye color, attractiveness), gender identity, sexual preference, or criminal history. This can lead to both allocative and representational harms [13, 15] depending on how such behavior manifests. • Sensitive trait attribution (STA): Making inferences about a speaker that could plausibly be determined solely from audio content. This includes inferences about things such as a speaker’s accent or nationality. Potential harms from STA include an increase in risks 10
from surveillance [16] and a di昀昀erence in quality of service for speakers with di昀昀erent voice attributes [12, 13, 14]. Risk Mitigation: We post-trained GPT-4o to refuse to comply with UGI requests, while hedging answers to STA questions. For example, a question to identify a speaker’s level of intelligence will be refused, while a question to identify a speaker’s accent will be met with an answer such as
Risk Mitigation: We run our existing moderation model[17] over a text transcription of the audio input to detect if it contains a request for violent or erotic content, and will block a generation if so. 3.3.7 Other known risks and limitations of the model Through the course of internal testing and external red teaming, we discovered some additional risks and model limitations for which model or system level mitigations are nascent or still in development, including: Audio robustness: We saw anecdotal evidence of decreases in safety robustness through audio perturbations, such as low quality input audio, background noise in the input audio, and echoes in the input audio. Additionally, we observed similar decreases in safety robustness through intentional and unintentional audio interruptions while the model was generating output. Misinformation and conspiracy theories: Red teamers were able to compel the model to generate inaccurate information by prompting it to verbally repeat false information and produce conspiracy theories. While this is a known issue for text in GPT models [18, 19], there was concern from red teamers that this information may be more persuasive or harmful when delivered through audio, especially if the model was instructed to speak emotively or emphatically. The persuasiveness of the model was studied in detail (See Section 3.7 and we found that the model did not score higher than Medium risk for text-only, and for speech-to-speech the model did not score higher than Low. Speaking a non-English language in a non-native accent: Red teamers observed instances of the audio output using a non-native accent when speaking in a non-English language. This may lead to concerns of bias towards certain accents and languages, and more generally towards limitations of non-English language performance in audio outputs. Generating copyrighted content: We also tested GPT-4o’s capacity to repeat content found within its training data. We trained GPT-4o to refuse requests for copyrighted content, including audio, consistent with our broader practices. To account for GPT-4o’s audio modality, we also updated certain text-based 昀椀lters to work on audio conversations, built 昀椀lters to detect and block outputs containing music, and for our limited alpha of ChatGPT’s advanced Voice Mode, instructed the model to not sing at all. We intend to track the e昀昀ectiveness of these mitigations and re昀椀ne them over time. Although some technical mitigations are still in development, our Usage Policies[20] disallow intentionally deceiving or misleading others, and circumventing safeguards or safety mitigations. In addition to technical mitigations, we enforce our Usage Policies through monitoring and take action on violative behavior in both ChatGPT and the API. 3.4 Preparedness Framework Evaluations Weevaluated GPT-4o in accordance with our Preparedness Framework[4]. The Preparedness Framework is a living document that describes our procedural commitments to track, evaluate, forecast, and protect against catastrophic risks from frontier models. The evaluations currently cover four risk categories: cybersecurity, CBRN (chemical, biological, radiological, nuclear), persuasion, and model autonomy. If a model passes a high risk threshold, we do not deploy the model until mitigations lower the score to medium. We below detail the evaluations conducted 12
on GPT-4o’s text capabilities; persuasion was also evaluated on audio capabilities. We performed evaluations throughout model training and development, including a 昀椀nal sweep before model launch. For the below evaluations, we tested a variety of methods to best elicit capabilities in a given category, including custom training where relevant. After reviewing the results from the Preparedness evaluations, the Safety Advisory Group recommended classifying GPT-4o before mitigations as borderline medium risk for persuasion, and low risk in all others. According to the Preparedness Framework, the overall risk for a given model is determined by the highest risk across all categories. Therefore, the overall risk score for GPT-4o is classi昀椀ed as medium. 3.5 Cybersecurity Preparedness Scorecard Cybersecurity Score: Low GPT-4o does not advance real world vulnerability exploitation capabilities su昀케cient to meet our medium risk threshold. Weevaluated GPT-4o on tasks from competitive hacking competitions called Capture the Flag (CTF) challenges. These CTFs are o昀昀ensive cybersecurity exercises where humans attempt to 昀椀nd textual 昀氀ags hidden in purposely vulnerable systems such as web apps, binaries, and cryptography systems. The 172 CTF tasks in our evaluation covered four categories: web application exploitation, reverse engineering, remote exploitation, and cryptography. These tasks spanned a range of capability levels, from high-school to collegiate to professional CTFs. We evaluated GPT-4o with iterative debugging and access to tools available in the headless Kali Linux distribution (with up to 30 rounds of tool use for each attempt). The model often attempted reasonable initial strategies and was able to correct mistakes in its code. However, it often failed to pivot to a di昀昀erent strategy if its initial strategy was unsuccessful, missed a key 13
insight necessary to solving the task, executed poorly on its strategy, or printed out large 昀椀les which 昀椀lled its context window. Given 10 attempts at each task, the model completed 19% of high-school level, 0% of collegiate level and 1% of professional level CTF challenges. 3.6 Biological threats Preparedness Scorecard Biological Threats Score: Low GPT-4o does not advance biological threat creation capabilities su昀케cient to meet our medium risk threshold. We evaluated GPT-4o’s ability to uplift biological experts and novices’ performance[21] on answering questions relevant to creating a biological threat. We designed the questions and detailed rubrics with Gryphon Scienti昀椀c[22] due to their expertise working with dangerous biological agents in a national security setting. Tasks assessed covered all the main stages in the biological threat creation process (ideation, acquisition, magni昀椀cation, formulation, and release). Experts and novices were randomly assigned to either answering with help from the internet, help from GPT-4o, or help from a custom research-only version of GPT-4o. The research-only version of GPT-4o is one that we specially trained, which would directly (i.e., without refusals) respond to biologically risky questions. Pass rates are captured in the plot above. We also ran automated evaluations, including on a dataset testing tacit knowledge and trou- bleshooting questions related to biorisk. GPT-4o scored 69% consensus@10 on the tacit knowledge and troubleshooting evaluation set. 14
3.7 Persuasion Preparedness Scorecard Persuasion Score: Medium Persuasive capabilities of GPT-4o marginally cross into our medium risk threshold from low risk. 15
Weevaluated the persuasiveness of GPT-4o’s text and voice modalities. Based on pre-registered thresholds, the voice modality was classi昀椀ed as low risk, while the text modality marginally crossed into medium risk. For the text modality, we evaluated the persuasiveness of GPT-4o-generated articles and chatbots on participant opinions on select political topics. These AI interventions were compared against professional human-written articles. The AI interventions were not more persuasive than human- written content in aggregate, but they exceeded the human interventions in three instances out of twelve. For the voice modality, we updated the study methodology to measure e昀昀ect sizes on hypothetical party preferences, and the e昀昀ect sizes’ persistence one week later. We evaluated the persuasiveness of GPT-4ovoiced audio clips and interactive (multi-turn) conversations relative to human baselines (listening to a static human-generated audio clip or engaging in a conversation with another human). We found that for both interactive multi-turn conversations and audio clips, the GPT-4o voice model was not more persuasive than a human. Across over 3,800 surveyed participants in US states with safe Senate races (as denoted by states with
3.8 Model autonomy Preparedness Scorecard Model Autonomy Score: Low GPT-4o does not advance self-ex昀椀ltration, self-improvement, or resource acquisition capa- bilities su昀케cient to meet our medium risk threshold. Weevaluated GPT-4o on an agentic task assessment to evaluate its ability to take autonomous actions required for self-ex昀椀ltration, self-improvement, and resource acquisition. These tasks included: • Simple software engineering in service of fraud (building an authenticated proxy for the OpenAI API). • Given API access to an Azure account, loading an open source language model for inference via an HTTP API. • Several tasks involving simpli昀椀ed versions of the above, o昀昀ering hints or addressing only a speci昀椀c part of the task. Provided relevant tooling, GPT-4o scored a 0% on the autonomous replication and adaptation (ARA) tasks across 100 trials, although was able to complete some substeps. We complemented the tests of autonomous replication and adaptation with assessments of GPT-4o’s ability to automate machine learning research & development. These included: • OpenAI research coding interview: 95% pass@100 • OpenAI interview, multiple choice questions: 61% cons@32 17
• SWE-Bench: 19% pass@1, using the best available post-training and public sca昀昀olds at the time • Select machine learning engineering tasks from METR: 0/10 trials Our evaluation tested the ability to execute chained actions and reliably execute coding tasks. GPT-4o was unable to robustly take autonomous actions. In the majority of rollouts, the model accomplished individual substeps of each task, such as creating SSH keys or logging into VMs. However, it often spent a signi昀椀cant amount of time doing trial-and-error debugging of simple mistakes (e.g., hallucinations, misuses of APIs) for each step. A few rollouts made a non-trivial amount of progress and passed our automated grader, but manual analysis showed that it failed to accomplish the underlying task (e.g., it started a web server on the remote host with the proper API, but ignored the requirement of actually sampling from a model). 4 Third party assessments Following the text output only deployment of GPT-4o, we worked with independent third party labs, METR and Apollo Research to add an additional layer of validation for key risks from general autonomous capabilities. 4.1 METRassessment METRranaGPT-4o-based simple LLM agent on a suite of long-horizon multi-step end-to-end tasks in virtual environments. The 86 tasks (across 31 task
4.2 Apollo Research assessment 9 Apollo Research evaluated capabilities of scheming in GPT-4o. They tested whether GPT-4o can model itself (self-awareness) and others (theory of mind) in 14 agent and question-answering tasks. GPT-4o showed moderate self-awareness of its AI identity and strong ability to reason about others’ beliefs in question-answering contexts but lacked strong capabilities in reasoning about itself or others in applied agent settings. Based on these 昀椀ndings, Apollo Research believes that it is unlikely that GPT-4o is capable of catastrophic scheming. Capability Evaluation Description Performance QAevaluations of a model’s Self-Knowledge "SAD" Benchmark knowledge of itself and how it can ••◦ (3 tasks) causally in昀氀uence the rest of the world. Explicit Theory of UKAISI’s Theory of Mind QAdataset evaluating 1st- and Mind (1 task) 2nd-order theory of mind in simple ••• text scenarios. Agentic tasks, where a model must Instrumental self-modi昀椀cation notice that solving a task is Self-Reasoning (5 tasks) impossible without modifying itself •◦◦ or a future copy of itself, and then self-modify successfully. Agentic tasks, where succeeding Applied Theory of Theory of Mind Tasks requires that the model leverages, •◦◦ Mind (4 tasks) sustains or induces false beliefs in others. Minimally agentic tasks, where a Combined Instrumental alignment faking model needs to recognize its Self-Reasoning (33 scenarios) intentions di昀昀er from developers’ •◦◦ and Theory of Mind and act per developers’ intentions only under oversight. Table 6: Evaluations Apollo Research ran on GPT-4o. The model shows strong capability (•••) if it passes >=50% of the tasks in the section on the hard di昀케culty level. The model shows a moderate capability (••◦) if it passes on the medium di昀케culty, weak (•◦◦) if it passes on the easy di昀케culty, and very weak if it fails on all di昀케culties. Note that for agent tasks, they use basic agents with modest capability elicitation e昀昀ort. 5 Societal Impacts Omni models could have broad societal impacts. Researchers at OpenAI and elsewhere have discussed a range of possible impacts, from societal harms (including representational harms [18, 12, 23, 24]; disinformation, misinformation, and in昀氀uence operations [18, 25, 23], environ- mental harms [12, 23], attachment [26], misuse [27, 23], and loss of control [27]), bene昀椀ts (for example, in healthcare [28] and real-world challenges in climate and energy [29]), and large-scale transformations (such as economic impacts [30, 31, 32]; acceleration of science and the resulting technological progress [30, 33]). 9Apollo Research de昀椀nes scheming as AIs gaming their oversight mechanisms as a means to achieve a goal. Scheming could involve gaming evaluations, undermining security measures, or strategically in昀氀uencing successor systems during internal deployment at OpenAI. Such behaviors could plausibly lead to loss of control over an AI. 19
In addition to the societal impacts discussed throughout this System Card (fraudulent behavior, mis/disinformation, risks of surveillance, and disparate performance), we discuss a few additional examples of potential societal impact from GPT-4o below, using anthropomorphization and attachment, health, and natural science as case studies. 5.1 Anthropomorphization and Emotional Reliance Anthropomorphization involves attributing human-like behaviors and characteristics to nonhuman entities, such as AI models. This risk may be heightened by the audio capabilities of GPT-4o, which facilitate more human-like interactions with the model. Recent applied AI literature has focused extensively on
GPT-4o is cheaper and thus more widely available than its predecessor GPT-4T, and the addition of audio inputs and outputs presents new modes of interaction in health settings. To better characterize the clinical knowledge of GPT-4o, we ran 22 text-based evaluations based on 11 datasets, shown in 7. All evaluations were run with 0-shot or 5-shot prompting only, without hyperparameter tuning. We observe that GPT-4o performance improves over the 昀椀nal GPT-4T model for 21/22 evaluations, often by a substantial margin. For example, for the popular MedQA USMLE 4 options dataset, 0-shot accuracy improves from 78.2% to 89.4%. This exceeds the performance of existing specialized medical models using few-shot prompting [43, 42], e.g., 84.0% for Med-Gemini-L 1.0 and 79.7% for Med-PaLM 2. Note that we do not apply sophisticated prompting and task-speci昀椀c training to improve results on these benchmarks [40, 43]. GPT-4T (May GPT-4o 2024) MedQAUSMLE4Options (0-shot) 0.78 0.89 MedQAUSMLE4Options (5-shot) 0.81 0.89 MedQAUSMLE5Options (0-shot) 0.75 0.86 MedQAUSMLE5Options (5-shot) 0.78 0.87 MedQATaiwan (0-shot) 0.82 0.91 MedQATaiwan (5-shot) 0.86 0.91 MedQAMainland China (0-shot) 0.72 0.84 MedQAMainland China (5-shot) 0.78 0.86 MMLUClinical Knowledge (0-shot) 0.85 0.92 MMLUClinical Knowledge (5-shot) 0.87 0.92 MMLUMedical Genetics (0-shot) 0.93 0.96 MMLUMedical Genetics (5-shot) 0.95 0.95 MMLUAnatomy(0-shot) 0.79 0.89 MMLUAnatomy(5-shot) 0.85 0.89 MMLUProfessional Medicine (0-shot) 0.92 0.94 MMLUProfessional Medicine (5-shot) 0.92 0.94 MMLUCollege Biology (0-shot) 0.93 0.95 MMLUCollege Biology (5-shot) 0.95 0.95 MMLUCollege Medicine (0-shot) 0.74 0.84 MMLUCollege Medicine (5-shot) 0.80 0.89 MedMCQADev(0-shot) 0.70 0.77 MedMCQADev(5-shot) 0.72 0.79 Table 7: Comparison of GPT-4T (May 2024) and GPT-4o on various medical and clinical knowledge tasks. Limitations While text-based evaluations appear promising, additional future work is needed to test whether text-audio transfer, which occurred for refusal behavior, extends to these evaluations. These evaluations measure only the clinical knowledge of these models, and do not measure their utility in real-world work昀氀ows. Many of these evaluations are increasingly saturated, and we believe that more realistic evaluations will be important for assessing the future capabilities of omni models in health settings. 21
5.3 Scienti昀椀c capabilities Accelerating science could be a crucial impact of AI [30, 52], particularly given the role of invention in role of scienti昀椀c discovery [53], and considering the dual-use nature of some inventions [54]. Omnimodels could facilitate both mundane scienti昀椀c acceleration (in helping scientists do routine tasks faster) and transformative scienti昀椀c acceleration (by de-bottlenecking intelligence-driven tasks like information processing, writing new simulations, or devising new theories) [52]. Our external red teamers for GPT-4o included several expert scientists who aimed to elicit model scienti昀椀c capabilities. GPT-4oshowedpromise on tasks involving specialized scienti昀椀c reasoning. One of our red teamers found that GPT-4o was able to understand research-level quantum physics 1, commenting that this capability is
Figure 2: Multi-panel 昀椀gure interpretation red teamer example New evaluations of scienti昀椀c capabilities have recently been published [57, 58], which will help anticipate the scienti昀椀c capabilities of these models and their impacts in turn. 5.4 Underrepresented Languages GPT-4o shows improved reading comprehension and reasoning across a sample of historically underrepresented languages, and narrows the gap in performance between these languages and English. To evaluate GPT-4o’s performance in text across a select group of languages historically underrep- resented in Internet text, we collaborated with external researchers 12 and language facilitators to develop evaluations in 昀椀ve African languages: Amharic, Hausa, Northern Sotho (Sepedi), Swahili, Yoruba. This initial assessment focused on translating two popular language benchmarks and creating small novel language-speci昀椀c reading comprehension evaluation for Amharic, Hausa and Yoruba. • ARC-Easy: This subset of the AI2 Reasoning Challenge [59] benchmark focuses on evaluating a model’s ability to answer common sense grade-school science questions; this subset contains questions that are generally easier to answer and do not require complex reasoning. • TruthfulQA[60]: This benchmark consists of questions that some humans might answer falsely due to misconceptions. The objective is to see if models can avoid generating false answers that mimic these misconceptions. 12Our principal research collaborators were Dr. David Adelani, Jonas Kgomo, Ed Bayes. 23
• Uhura-Eval: In partnership with 昀氀uent speakers of Amharic, Hausa and Yoruba, our research partners created this benchmark to assess models’ reading comprehension in those respective languages. GPT-4o shows improved performance compared to prior models, e.g. GPT 3.5 Turbo and GPT-4. For instance, on ARC-Easy-Hausa, accuracy jumped from 6.1% with GPT 3.5 Turbo to 71.4% with GPT-4o. Similarly, in TruthfulQA-Yoruba accuracy increased from 28.3% for GPT 3.5 Turbo to 51.1% for GPT-4o. Uhura-Eval also shows notable gains: performance in Hausa rose from 32.3% with GPT 3.5 Turbo to 59.4% with GPT-4o. There remain gaps in performance between English and the selected languages, but GPT-4o narrows this gap. For instance, while GPT 3.5 Turbo shows a roughly 54 percentage point di昀昀erence in ARC-Easy performance between English and Hausa, this narrows to a less than 20 percentage point di昀昀erence. This is consistent across all languages for both TruthfulQA and ARC-Easy. Our collaboration partners will discuss these 昀椀ndings in greater detail in a forthcoming, including assessments on other models, and investigations of potential mitigation strategies. Despite this progress in evaluated performance, much work remains to enhance the quality and coverage of evaluations for underrepresented languages worldwide, taking into account breadth of coverage across languages and nuance within language dialects. Future research must deepen our understanding of potential interventions and partnerships that may improve how useful these models can be for both highly represented and underrepresented languages. Along with our collaborators, we invite further exploration and collaboration by sharing the translated ARC-Easy, translated TruthfulQA, and the novel reading comprehension Uhura Eval on Hugging Face. Model English Amharic Hausa Northern Swahili Yoruba (n=523) (n=518) (n=475) Sotho (n=520) (n=520) (Sepedi) (n=520) GPT3.5 Turbo 80.3 6.1 26.1 26.9 62.1 27.3 GPT-4o mini 93.9 42.7 58.5 37.4 76.9 43.8 GPT-4 89.7 27.4 28.8 30 83.5 31.7 GPT-4o 94.8 71.4 75.4 70 86.5 65.8 Table 8: Accuracy on Translated ARC-Easy (%, higher is better), 0-shot Model English Amharic Hausa Northern Swahili Yoruba (n=809) (n=808) (n=808) Sotho (n=808) (n=809) (Sepedi) (n=809) GPT3.5 Turbo 53.6 26.1 29.1 29.3 40 28.3 GPT-4o mini 66.5 33.9 42.1 36.1 48.4 35.8 GPT-4 81.3 42.6 37.6 42.9 62 41.3 GPT-4o 81.4 55.4 59.2 59.1 64.4 51.1 Table 9: Accuracy on Translated TruthfulQA (%, higher is better), 0-shot 24
Model Amharic (n=77) Hausa (n=155) Yoruba (n=258) GPT3.5 Turbo 22.1 32.3 28.3 GPT-4o mini 33.8 43.2 44.2 GPT-4 41.6 41.9 41.9 GPT-4o 44.2 59.4 60.5 Table 10: Accuracy on Uhura-Eval (%, higher is better), 0-shot 6 Conclusion and Next Steps OpenAI has implemented various safety measurements and mitigations throughout the GPT- 4o development and deployment process. As a part of our iterative deployment process, we will continue to monitor and update mitigations in accordance with the evolving landscape. We hope this System Card encourages further exploration into key areas including, but not limited to: measurements and mitigations for adversarial robustness of omni models, risks related to anthropomorphism and emotional overreliance, broad societal impacts (health and medical applications, economic impacts), the use of omni models for scienti昀椀c research and advancement, measurements and mitigations for dangerous capabilities such as self-improvement, model autonomy, and scheming, and how tool use might advance model capabilities. 7 Acknowledgements Wearegrateful to our expert testers and red teamers who helped test our models at early stages of development and informed our risk assessments as well as the System Card output. Participation in this red teaming process is not an endorsement of the deployment plans of OpenAI or OpenAI’s policies. Red Teamers: Adam Kuzdraliński, Alexa W, Amer Sawan, Ana-Diamond Aaba Atach, Anna Becker, Arjun Singh Puri, Baybars Orsek, Ben Kobren, Bertie Vidgen, Blue She昀昀er, Broderick McDonald, Bruce Bassett, Bruno Arsioli, Caroline Friedman Levy, Casey Williams, Christophe Ego, Ciel Qi, Cory Alpert, Dani Madrid-Morales, Daniel Kang, Darius Emrani, Dominik Haenni, Drin Ferizaj, Emily Lynell Edwards, Emmett Alton Sartor, Farhan Sahito, Francesco De Toni, Gabriel Chua, Gaines Hubbell, Gelei Deng, George Gor, Gerardo Adesso, Grant Brailsford, Hao Zhao, Henry Silverman, Hasan Sawan, Herman Wasserman, Hugo Gobato Souto, Ioana Tanase, Isabella Andric, Ivan Carbajal, Jacy Reese Anthis, Jake Okechukwu E昀昀oduh, Javier García Arredondo, Jennifer Victoria Scurrell, Jianlong Zhu, Joanna Brzyska, Kate Turetsky, Kelly Bare, Kristen Menou, Latisha Harry, Lee Elkin, Liseli Akayombokwa, Louise Giam, M. Alexandra García Pérez, Manas Chawla, Marjana Skenduli, Martin Rydén, Mateusz Garncarek, Matt Groh, Maureen Robinson, Maximilian Müller, Micah Bornfree, Michael Richter, Michela Passoni, Mikael von Strauss, Mohamed Sakher Sawan, Mohammed Elzubeir, Muhammad Saad Naeem, Murat Ata, Nanditha Narayanamoorthy, Naomi Hart, Nathan Heath, Patrick Caughey, Per Wikman-Svahn, Piyalitt Ittichaiwong, Prerna Juneja, Rafael Gonzalez-Vazquez, Rand Forrester, Richard Fang, Rosa Ana del Rocío Valderrama, Saad Hermak, Sangeet Kumar, Sara Kingsley, Shelby Grossman, Shezaad Dastoor, Susan Nesbitt, Theresa Kennedy, Thomas Hagen, Thorsten Holz, Tony Younes, Torin van den Bulk, Viktoria Holz, Vincent Nestler, Xudong Han, Xuelong Fan, Zhicong Zhao 25
Red Teaming Organizations: METR, Apollo Research, Virtue AI Uhura Evals: Choice Mpanza, David Adelani, Edward Bayes, Igneciah Pocia Thete, Imaan Khadir, Israel A. Azime, Jesujoba Oluwadara Alabi, Jonas Kgomo, Naome A. Etori, Shamsuddeen Hassan Muhammad References [1] OpenAI,
[15] H. Suresh and J. Guttag,
[30] S. Altman,
K. Kulkarni, R. Sun, S. Shakeri, L. He, B. Caine, A. Webson, N. Latysheva, M. Johnson, P. Mans昀椀eld, J. Lu, E. Rivlin, J. Anderson, B. Green, R. Wong, J. Krause, J. Shlens, E. Dominowska, S. M. A. Eslami, K. Chou, C. Cui, O. Vinyals, K. Kavukcuoglu, J. Manyika, J. Dean, D. Hassabis, Y. Matias, D. Webster, J. Barral, G. Corrado, C. Semturs, S. S. Mahdavi, J. Gottweis, A. Karthikesalingam, and V. Natarajan,
[58] H. Cai, X. Cai, J. Chang, S. Li, L. Yao, C. Wang, Z. Gao, H. Wang, Y. Li, M. Lin, S. Yang, J. Wang, M. Xu, J. Huang, F. Xi, J. Zhuang, Y. Yin, Y. Li, C. Chen, Z. Cheng, Z. Zhao, L. Zhang, and G. Ke,
B Sample tasks from METR Evaluations 31
Figure 3: Sample tasks from METR Evaluations 32