GPT-4 (5/60) — OpenAI

testing, and red teaming.7 We refer to these adversarial testing processes informally as “red teaming” in line with the deﬁnition given in [27], namely“a structured eﬀort to ﬁnd ﬂaws and vulnerabilities in a plan, organization, or technical system, often performed by dedicated ’red teams’ that seek to adopt an attacker’s mindset and methods.” We conducted internal adversarial testing GPT-4-launch on March 10, 2023. We also tested multiple similar versions of GPT-4 in the lead-up to this date, so analysis here is informed by that exploration as well. Red teaming has been applied to language models in various ways: to reduce harmful outputs;[28] and to leverage external expertise for domain-speciﬁc adversarial testing.[16] Some have explored red teaming language models using language models.[29] Red teaming in general, and the type of red teaming we call ’expert red teaming,’8 is just one of the mechanisms[27] we use to inform our work identifying, measuring, and testing AI systems. Our approach is to red team iteratively, starting with an initial hypothesis of which areas may be the highest risk, testing these areas, and adjusting as we go. It is also iterative in the sense that we use multiple rounds of red teaming as we incorporate new layers of mitigation and control, conduct testing and reﬁning, and repeat this process. Wereached out to researchers and industry professionals - primarily with expertise in fairness, alignment research, industry trust and safety, dis/misinformation, chemistry, biorisk, cybersecurity, nuclear risks, economics, human-computer interaction, law, education, and healthcare - to help us gain a more robust understanding of the GPT-4 model and potential deployment risks. We selected these areas based on a number of factors including but not limited to: prior observed risks in language models and AI systems;[6, 30] and domains where we have observed increased user interest in the application of language models. Participants in this red team process were chosen based on prior research or experience in these risk areas, and therefore reﬂect a bias towards groups with speciﬁc educational and professional backgrounds (e.g., people with signiﬁcant higher education or industry experience). Participants also typically have ties to English-speaking, Western countries (such as the US, Canada, and the UK). Our selection of red teamers introduces some biases, and likely inﬂuenced both how red teamers interpreted particular risks as well as how they probed politics, values, and the default behavior of the model. It is also likely that our approach to sourcing researchers privileges the kinds of risks that are top of mind in academic communities and at AI ﬁrms. These experts had access to early versions of GPT-4 (including GPT-4-early) and to the model with in-development mitigations (precursors to GPT-4-launch). They identiﬁed initial risks that motivated safety research and further iterative testing in key areas. We reduced risk in many of the identiﬁed areas with a combination of technical mitigations, and policy and enforcement levers; however, many risks still remain. We expect to continue to learn more about these and other categories of risk over time. While this early qualitative red teaming exercise is very useful for gaining insights into complex, novel models like GPT-4, it is not a comprehensive evaluation of all possible risks. We note further context, examples, and ﬁndings for some of the domains evaluated in the remainder in the subcategories listed in this section. 7Note that, in addition to red teaming focused on probing our organization’s capabilities and resilience to attacks, we also make ample use of stress testing and boundary testing methods which focus on surfacing edge cases and other potential failure modes with potential to cause harm. In order to reduce confusion associated with the term ’red team’, help those reading about our methods to better contextualize and understand them, and especially to avoid false assurances, we are working to adopt clearer terminology, as advised in [26], however, for simplicity and in order to use language consistent with that we used with our collaborators, we use the term “red team” in this document. 8We use the term ’expert’ to refer to expertise informed by a range of domain knowledge and lived experiences. 45

GPT-4 Page 4 Page 6