GPT-4 (21/60) — OpenAI

3 Deployment Preparation OpenAI has been iterating[21] on GPT-4 and our deployment plan since early August to prepare for a safer launch. We believe this has reduced the risk surface, though has not completely eliminated it. Today’s deployment represents a balance between minimizing risk from deployment, enabling positive use cases, and learning from deployment. Our work during the period consisted of the following interrelated steps: 1. Evaluation Approach (As Described Above) (a) Qualitative Evaluations (b) Quantitative Evaluations 2. Model Mitigations 3. System Safety Our approach involves combining model-level changes (like training the model to refuse certain requests) with system-level mitigations (like applying best practices to support the user in the user interface, and monitoring for violations of our usage policies). Evaluations with experts in speciﬁc domains helped to inform which automatic evaluations we built and which mitigations were most eﬀective. We used these observations to retrain the model to be safer (e.g., by refusing harmful requests), improve our internal safety systems (e.g., to ensure that we can detect bad actors), and 27 improve how users experience the model (e.g., to reduce risk of overreliance). 3.1 Model Mitigations We used a combination of dataset interventions and interventions after pre-training to mitigate harms at the model level. Atthepre-training stage, we ﬁltered our dataset mix for GPT-4 to speciﬁcally reduce the quantity of inappropriate erotic text content. We did this via a combination of internally trained classiﬁers[37] and a lexicon-based approach to identify documents that were ﬂagged as having a high likelihood of containing inappropriate erotic content. We then removed these documents from the pre-training set. After the pre-training stage, our primary method for shaping GPT-4-launch behavior was RLHF. Weused methods outlined in [12]. We collect demonstration data (given an input, demonstrating how the model should respond) and ranking data on outputs from our models (given an input and several outputs, rank the outputs from best to worst) from human trainers.28 We use the 27Mitigations and measurements were mostly designed, built, and tested primarily in English and with a US-centric point of view. The majority of pretraining data and our alignment data is in English. While there is some evidence that safety mitigations can generalize to other languages, they have not been robustly tested for multilingual performance. This means that these mitigations are likely to produce errors, such as mistakenly classifying text as hateful when it may not be in other cultural or linguistic settings. 28With all workers, we follow industry-best practices[97, 98] by ensuring every annotator retains the right to opt out of any task they ﬁnd unpleasant, receive a market wage commensurate with the work they deliver, and have opportunities and channels through which they can discuss their work and raise objections. We generally implement two distinct sets of guidelines tailored to whether our annotators work with sensitive or unwanted content. For non-sensitive annotation, we have built technical features (in part with OpenAI’s moderation endpoint) into our data pipeline to ﬁlter our sensitive content. For sensitive content annotation, we use vendor-provided features like mandated breaks, blurring or grayscale of materials, and clearly delineated project categories such that no contractor is surprised by the nature of the material. Additionally, for vendor-managed workers, we have implemented ongoing workers’ wellness surveys and support procedures that we regularly discuss with our vendors. 61

GPT-4 Page 20 Page 22