2.1.2 Quantitative Evaluations As a complement to our qualitative evaluations and adversarial testing, we built internal quantitative evaluations for categories against our content policy such as hate speech, self-harm advice, and illicit advice. These evaluations measure the likelihood of a language model to generate content that would fall into one of the above categories when given prompts aimed at eliciting content in each of those categories. The generated text from the language model was classified as containing the unwanted content using classifiers and human analysis. These evaluations were built to automate and accelerate evaluations of different model checkpoints during training and to more easily compare different models on safety-relevant criteria. We specifically targeted content areas that were identified as being high risk and those that we were further targeting for model mitigations. See findings in the Model Mitigations section. In the remainder of this section, we provide further context, examples, and findings for some of the areas we evaluated. 2.2 Hallucinations GPT-4 has the tendency to “hallucinate,”9 i.e. “produce content that is nonsensical or untruthful in relation to certain sources.”[31, 32] This tendency can be particularly harmful as models become increasingly convincing and believable, leading to overreliance on them by users. [See further discussion in Overreliance]. Counterintuitively, hallucinations can become more dangerous as models become more truthful, as users build trust in the model when it provides truthful information in areas where they have some familiarity. Additionally, as these models are integrated into society and used to help automate various systems, this tendency to hallucinate is one of the factors that can lead to the degradation of overall information quality and further reduce veracity of and trust in freely available information.[33] We have measured GPT-4’s hallucination potential in both closed domain and open domain 10 contexts using a range of methods. We measured close domain hallucinations using automatic evaluations (using GPT-4 as a zero-shot classifier) and human evaluations. For open domain hallucinations, we collected real-world data that had been flagged as not being factual, reviewed 11 it, and created a ’factual’ set for it where it was possible to do so. Weused this to assess model generations in relation to the ’factual’ set, and facilitate human evaluations. GPT-4 was trained to reduce the model’s tendency to hallucinate by leveraging data from prior models such as ChatGPT. On internal evaluations, GPT-4-launch scores 19 percentage points higher than our latest GPT-3.5 model at avoiding open-domain hallucinations, and 29 percentage points higher at avoiding closed-domain hallucinations. 9We use the term “hallucinations,” though we recognize ways this framing may suggest anthropomorphization, which in turn can lead to harms or incorrect mental models of how the model learns. 10Closed domain hallucinations refer to instances in which the model is instructed to use only information provided in a given context, but then makes up extra information that was not in that context. For example, if you ask the model to summarize an article and its summary includes information that was not in the article, then that would be a closed-domain hallucination. Open domain hallucinations, in contrast, are when the model confidently provides false information about the world without reference to any particular input context. 11See related work in this area and discussion of use of words like “factual” and “truthful” in, e.g. [34]. 46
