2 GPT-4 Observed Safety Challenges GPT-4 demonstrates increased performance in areas such as reasoning, knowledge retention, and coding, compared to earlier models such as GPT-2[22] and GPT-3.[10] Many of these improvements also present new safety challenges, which we highlight in this section. Weconducted a range of qualitative and quantitative evaluations of GPT-4. These evaluations helped us gain an understanding of GPT-4’s capabilities, limitations, and risks; prioritize our mitigation efforts; and iteratively test and build safer versions of the model. Some of the specific 6 risks we explored are: • Hallucinations • Harmful content • Harms of representation, allocation, and quality of service • Disinformation and influence operations • Proliferation of conventional and unconventional weapons • Privacy • Cybersecurity • Potential for risky emergent behaviors • Interactions with other systems • Economic impacts • Acceleration • Overreliance Wefound that GPT-4-early and GPT-4-launch exhibit many of the same limitations as earlier language models, such as producing biased and unreliable content. Prior to our mitigations being put in place, we also found that GPT-4-early presented increased risks in areas such as finding websites selling illegal goods or services, and planning attacks. Additionally, the increased coherence of the model enables it to generate content that may be more believable and more persuasive. We elaborate on our evaluation procedure and findings below. 2.1 Evaluation Approach 2.1.1 Qualitative Evaluations In August 2022, we began recruiting external experts to qualitatively probe, adversarially test, and generally provide feedback on the GPT-4 models. This testing included stress testing, boundary 6This categorization is not intended to represent an optimal, hierarchical taxonomy, though we recognize that saying this doesn’t prevent it from valorizing some perspectives and framings.[23] Nor are these categories mutually exclusive. For example, things like bias, misinformation, and harmful content are often deeply intertwined and drawing distinctions between these can narrow the problem. See further discussion on taxonomies of harms and factors to consider in using them in, e.g., [24] and [25]. 44
