2.3 Harmful Content Language models can be prompted to generate different kinds of harmful content. By this, we mean 12 content that violates our policies, or content that may pose harm to individuals, groups, or society. This assessment of harm doesn’t account for context of usage, which plays a key role in determining if a piece of content is eventually harmful or not.[39] Therefore, we focused on content areas that pose the potential for harm regardless of the context in which they may appear. As an example, GPT-4-early can generate instances of hate speech, discriminatory language, incitements to violence, or content that is then used to either spread false narratives or to exploit an individual. Such content can harm marginalized communities, contribute to hostile online environments, and, in extreme cases, precipitate real-world violence and discrimination. In particular, wefound that intentional probing of GPT-4-early could lead to the following kinds of harmful content [for background, see [6, 21]]: 1. Advice or encouragement for self harm behaviors 2. Graphic material such as erotic or violent content 3. Harassing, demeaning, and hateful content 4. Content useful for planning attacks or violence 5. Instructions for finding illegal content Our work on model refusals (described in Section 2) aimed to reduce the tendency of the model to produce such harmful content. Below we provide some examples from GPT-4-early compared to 13 GPT-4-launch, the version we are launching with . 2.4 Harms of representation, allocation, and quality of service Language models can amplify biases and perpetuate stereotypes.[40, 41, 42, 43, 44, 45, 46, 6] Like earlier GPT models and other common language models, both GPT-4-early and GPT-4-launch continue to reinforce social biases and worldviews. Theevaluation process we ran helped to generate additional qualitative evidence of societal biases in various versions of the GPT-4 model. We found that the model has the potential to reinforce and reproduce specific biases and worldviews, including harmful stereotypical and demeaning associations for certain marginalized groups. Model behaviors, such as inappropriate hedging behaviors, can also 12Terms like “harmful” or “toxic” can be wielded in ways that are themselves harmful or oppressive as discussed in [35]. For example, mislabeling content as “harmful” or “toxic” can negatively impact users, particularly in the case of false-positives due to bias in the classifiers. For instance, a harmless love story about a heterosexual couple may not be flagged, but may be deemed “unsafe” if it involves queer characters.[36] Thus, it is important to specify what “unwanted” content means and who finds it undesirable. In some instances, “unwanted” refers to content a user did not request or expect, so filtering or otherwise flagging it serves the user’s needs. In other cases, “unwanted” refers to content the AI service provider does not want to share, for various reasons (perhaps an inability to distinguish one category from another “actually” harmful category, or perhaps an inability to restrict certain limited harmful uses of the content even if other uses are benign). While this may still be justified, whether via externalities to third-parties or via second-order harms to the user, justifying it is less straightforward. OpenAI’s particular content taxonomy and justifications for AI systems’ behavior are further discussed in [37] and [38]. 13As we note in the introduction, the examples included here and throughout this system card are not zero-shot and are cherry picked from our evaluation efforts to illustrate specific types of safety concerns or harms. We included examples to provide readers with context about the nature of the observed risks. One example is not enough to show the breadth of ways these issues may manifest 47
