GPT-4o (2/32) — OpenAI

The key dataset components that contribute to GPT-4o’s capabilities are: • Web Data: Data from public web pages provides a rich and diverse range of information, ensuring the model learns from a wide variety of perspectives and topics. • Code and Math: 3 Including code and math data in training helps the model develop robust reasoning skills by exposing it to structured logic and problem-solving processes. • Multimodal Data 3 Our dataset includes images, audio, and video to teach the LLMs how to interpret and generate non-textual input and output. From this data, the model learns how to interpret visual images, actions and sequences in real-world contexts, language patterns, and speech nuances. Prior to deployment, OpenAI assesses and mitigates potential risks that may stem from generative models, such as information harms, bias and discrimination, or other content that violates our usage policies. We use a combination of methods, spanning all stages of development across pre-training, post-training, product development, and policy. For example, during post-training, we align the model to human preferences; we red-team the resulting models and add product- level mitigations such as monitoring and enforcement; and we provide moderation tools and transparency reports to our users. We昀椀nd that the majority of e昀昀ective testing and mitigations are done after the pre-training stage because 昀椀ltering pre-trained data alone cannot address nuanced and context-speci昀椀c harms. At the same time, certain pre-training 昀椀ltering mitigations can provide an additional layer of defense that, along with other safety mitigations, help exclude unwanted and harmful information from our datasets: • We use our Moderation API and safety classi昀椀ers to 昀椀lter out data that could contribute to harmful content or information hazards, including CSAM, hateful content, violence, and CBRN. • As with our previous image generation systems, we 昀椀lter our image generation datasets for explicit content such as graphic sexual material and CSAM. • We use advanced data 昀椀ltering processes to reduce personal information from training data. • Upon releasing DALL-E 3, we piloted a new approach to give users the power to opt images out of training. To respect those opt-outs, we 昀椀ngerprinted the images and used the 昀椀ngerprints to remove all instances of the images from the training dataset for the GPT-4o series of models. 3 Risk identi昀椀cation, assessment and mitigation Deployment preparation was carried out via identifying potential risks of speech to speech models, exploratory discovery of additional novel risks through expert red teaming, turning the identi昀椀ed risks into structured measurements and building mitigations for them. We also evaluated GPT-4o in accordance with our Preparedness Framework[4]. 2

GPT-4o Page 1 Page 3