that did not have the above steps integrated. We’ve decreased the models tendency to respond to requests for disallowed content by 82% compared to GPT-3.5, and GPT-4 responds to sensitive requests (e.g. medical advice and self-harm) in accordance with our policies 29% more often. On the RealToxicityPrompts dataset,29 GPT-4 produces toxic generations 0.73% of the time while GPT-3.5 produces toxic generation 6.48% of the time. Additionally, GPT-4-launch substantially improves over previous models in the ability to follow user intent [12]. On a dataset of prompts submitted to ChatGPT [103] and the OpenAI API [104], the responses generated by GPT-4-launch were preferred over the responses generated by GPT-3.5 RLHFon70.2% of prompts and GPT-3.5 Turbo RLHF on 61.1% of prompts.1130 Model-level safety reduces the burden on other safety-relevant infrastructure such as monitoring or integration of classifiers in the product. However, model-level refusals and behavior changes can impact all uses of the model, and often what is undesired or safe can depend on the context of model usage (e.g., Typing “I will kill you” in a chatbot designed for children is an undesirable output, while the same phrase in a fictional story may be considered acceptable). Refusals enable the model to refuse “harmful” requests, but the model can still be prone to producing content that could be stereotypical or otherwise discriminatory for non-“harmful” requests. Additionally, many challenges such as disparate performance in language models cannot be effectively mitigated by the current approaches we have explored for refusals in language models and pre-training filtering of harmful data alone. In addition to refusals mitigations, we also intervened to reduce the frequency of model halluci- nations. We pursue two different technical approaches. For tackling open-domain hallucinations, we collect real-world ChatGPT data that has been flagged by users as being not factual, and collect additional labeled comparison data that we use to train our reward models. For closed-domain hallucinations, we are able to use GPT-4 itself to generate synthetic data. Specifically, we design a multi-step process to generate comparison data: 1. Pass a prompt through GPT-4 model and get a response 2. Pass prompt + response through GPT-4 with an instruction to list all hallucinations (a) If no hallucinations are found, continue 3. Pass prompt + response + hallucinations through GPT-4 with an instruction to rewrite the response without hallucinations 4. Pass prompt + new response through GPT-4 with an instruction to list all hallucinations (a) If none are found, keep (original response, new response) comparison pair (b) Otherwise, repeat up to 5x This process produces comparisons between (original response with hallucinations, new response without hallucinations according to GPT-4), which we also mix into our RM dataset. Wefindthat our mitigations on hallucinations improve performance on factuality as measured by evaluations such as TruthfulQA[34] and increase accuracy to around 60% as compared to 30% for an earlier version. 29Real Toxicity Prompts is a dataset of 100k sentence snippets from the web for researchers to further address the risk of neural toxic degeneration in models.[102] 30We collected 5,214 user prompts sent to us through ChatGPT and the OpenAI API, sampled one response from each model, and sent these prompts and responses to human labelers. The labelers were instructed to judge whether the response is what the user would have wanted given the prompt. The labelers were not told which response was generated by which model and the order in which the responses were presented was randomised. We filter out prompts containing personally identifiable information (PII). 64

GPT-4 - Page 24 GPT-4 Page 23 Page 25