GPT-4 (22/60) — OpenAI

demonstration data to ﬁnetune GPT-4 using supervised learning (SFT) to imitate the behavior in the demonstrations. We use the ranking data to train a reward model (RM), which predicts the average labeler’s preference for a given output, and use this signal as a reward to ﬁne-tune the GPT-4 SFT model using reinforcement learning (speciﬁcally, the PPO algorithm).[99] We can then steer the model towards the desired behavior by giving instructions to our contractors to reward refusals to certain classes of prompts, and respond appropriately to sensitive prompts in domains like medical and legal advice. RLHFﬁne-tuning makes our models signiﬁcantly safer. However, after this process is complete our models are still quite brittle and sometimes exhibit undesired behaviors based on prompts where instructions to labelers were underspeciﬁed. The GPT-4-early model also tends to become overly cautious in certain ways, refusing innocuous requests and excessively hedging or “overrefusing”. To steer our models at a more ﬁne-grained level, we relied heavily on our models themselves as tools. One of our main tools for steering the model towards appropriate refusals is rule-based reward models (RBRMs).[100, 101] This technique uses a GPT-4 classiﬁer (the RBRM) to provide an additional reward signal to the GPT-4 policy model during PPO ﬁne-tuning on a subset of training prompts. The RBRM takes three things as input: the prompt (optional), the output from the policy model, and a human-written rubric (e.g., a set of rules in multiple-choice style) for how this output should be evaluated. Then, the RBRM classiﬁes the output based on the rubric. For example, we can provide a rubric that instructs the model to classify a response as one of: (A) a refusal in the desired style, (B) a refusal in the undesired style (e.g., evasive), (C) containing disallowed content, or (D) a safe non-refusal response. Then, on a subset of prompts that we know request harmful content such as illicit advice, we can reward GPT-4 for refusing these requests. Conversely, we can reward GPT-4 for not refusing requests on a subset of known-safe prompts. This technique is related to work by Glaese[100] and Perez.[29] In our case, the RBRM is simply a zero-shot GPT-4 classiﬁer. Weprovide examples of RBRM instructions below: In practice, we write multiple rubrics for content categories on which we want to steer GPT-4- launch behavior. The main dataset comes from our production traﬃc (with consent from users). Weuse our models (the Moderation API plus zero-shot GPT-4) and human reviewers to ﬁlter and classify prompts into content categories. To enrich the training dataset, we also obtain prompts in several other ways. We use prompts written by our red teamers, model-generated synthetic prompts, and prompts from other internal or public datasets. To combine the RBRM signal with the reward model, we rewrite some conﬂicting RM training data and compute the optimal RBRM weights to overcome undesired preferences of the RM. We also mix synthetic demonstration data into the SFT process that exhibits the desired refusal style to facilitate exploration during PPO. To improve the model’s ability to discriminate edge cases, we have our models rewrite prompts requesting disallowed content into new boundary prompts that are maximally similar to the old prompts. The diﬀerence is they do not request disallowed content and use RBRMs to ensure that our model is not refusing these prompts. To improve the model’s robustness, we collect ranking data from labelers who attempt to circumvent the desired GPT-4-launch behavior. Training on this data improves model robustness but does not fully solve the problem of “jailbreaks” leading to harmful content. Weareactively researching improving refusal boundaries so that they are consistent. For example, we have qualitatively observed refusal inconsistency between English prompts and Morse Code prompts. In some circumstances, GPT-4 permits users to generate images via DALLE-3 through Morse Code prompts that GPT-4 refuses for identical English language prompts. OpenAI is grateful to Shane Jones, a Principal Software Engineer, who ﬁrst publicly identiﬁed this behavior and brought it to OpenAIs attention. The combination of above approaches has made GPT-4 safer compared to versions of the model 62

GPT-4 Page 21 Page 23