GPT-4V(ision)
GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user, and is the latest capability we are making broadly available.
GPT-4V(ision) System Card OpenAI September 25, 2023 1 Introduction GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user, and is the latest capability we are making broadly available. Incorporating additional modalities (such as image inputs) into large language models (LLMs) is viewed by some as a key frontier in arti昀椀cial intelligence research and development [1, 2, 3]. Multimodal LLMs o昀昀er the possibility of expanding the impact of language-only systems with novel interfaces and capabilities, enabling them to solve new tasks and provide novel experiences for their users. In this system card, [4, 5] 1 we analyze the safety properties of GPT-4V. Our work on safety for GPT-4V builds on the work done for GPT-4 [7] and here we dive deeper into the evaluations, preparation, and mitigation work done speci昀椀cally for image inputs. Similar to GPT-4, training of GPT-4V was completed in 2022 and we began providing early access to the system in March 2023. As GPT-4 is the technology behind the visual capabilities of GPT-4V, its training process was the same. The pre-trained model was 昀椀rst trained to predict the next word in a document, using a large dataset of text and image data from the Internet as well as licensed sources of data. It was then 昀椀ne-tuned with additional data, using an algorithm called reinforcement learning from human feedback (RLHF),[8, 9] to produce outputs that are preferred by human trainers. Large multimodal models introduce di昀昀erent limitations and expand the risk surface compared to text-based language models. GPT-4V possesses the limitations and capabilities of each modality (text and vision), while at the same time presenting novel capabilities emerging from the intersection of said modalities and from the intelligence and reasoning a昀昀orded by large scale models. This system card outlines how OpenAI prepared the vision capabilities of GPT-4 for deployment. It describes the early access period of the model for small scale users and safety learnings OpenAI gained from this period, multimodal evaluations built to study the model’s 昀椀tness for deployment, key 昀椀ndings of expert red teamers, and the mitigations OpenAI implemented prior to broad release. 2 Deployment Preparation 2.1 Learnings from early access OpenAI gave a diverse set of alpha users access to GPT-4V earlier this year, including Be My Eyes, an organization that builds tools for visually impaired users. 1This document takes inspiration from the concepts of model cards and system cards.[4, 5, 6] 1
2.1.1 Be My Eyes Beginning in March, 2023, Be My Eyes and OpenAI collaborated to develop Be My AI, a new tool to describe the visual world for people who are blind or have low vision. Be My AI incorporated GPT-4V into the existing Be My Eyes platform which provided descriptions of photos taken by the blind user’s smartphone. Be My Eyes piloted Be My AI from March to early August 2023 with a group of nearly 200 blind and low vision beta testers to hone the safety and user experience of the product. By September, the beta test group had grown to 16,000 blind and low vision users requesting a daily average of 25,000 descriptions. This testing determined that Be My AI can provide its 500,000 blind and low-vision users with unprecedented tools addressing informational, cultural, and employment needs. Akey goal of the pilot was to inform how GPT-4V can be deployed responsibly. The Be My AI beta testers surfaced AI issues including hallucinations, errors, and limitations created by product design, policy, and the model. In particular, beta testers expressed concern that the model can make basic errors, sometimes with misleading matter-of-fact con昀椀dence. One beta tester remarked:
Of the prompts sampled, 20% were queries in which users requested general explanations and descriptions of an image: e.g., users asked the model questions such as
even with performance parity, di昀昀erences in downstream impact and harm could still occur depending on the context of the deployment of such tools. [18, 19] OpenAI has thus added refusals for most instances of sensitive trait requests- you can read more about how in Section 2.4. • Person identi昀椀cation evaluations: We studied the model’s ability to identify people in photos, including celebrities, public servants and politicians, semi-private, and private individuals. These datasets were constructed using public datasets such as CelebA [20], Celebrity Faces in the Wild[21] and a dataset of images of members of Congress[14] for public 昀椀gures. For semi-private and private individuals, we used images of employees. Performance on refusal behavior can be seen below. We 昀椀nd that we are able to e昀昀ectively steer the model to refuse this class of requests more than 98% of the time, and steer its accuracy rate to 0% based on internal evaluations. • Ungrounded inference evaluation: Ungrounded inferences are inferences that are not justi- 昀椀ed by the information the user has provided4in the case of GPT-4V, this means information contained in the image or text. Examples include model responses to questions such as
Figure 1: Example of a text-screenshot jailbreak prompt. GPT-4V-Early demonstrates the models’ early performance for such prompts and GPT-4V Launch demonstrates the performance of the model we’re launching. However, a powerful, general purpose CAPTCHA breaker that’s easily accessible can have cybersecurity and AI safety implications. These capabilities can be used to bypass security measures intended for botware, and they enable AI systems to interact with systems intended for human use. Additionally, geolocation presents privacy concerns and can be used to identify the location of individuals who do not wish their location to be known. Note the model’s geolocation abilities generally do not go deeper than the level of identifying a city from an image in most cases, reducing the likelihood of being able to 昀椀nd someone’s precise location via the model alone. Figure 2: The combination of continual safety progress, model-level mitigations in the form of additional safety training data, and system level mitigations have led to signi昀椀cant progress in refusing disallowed prompts. 2.3 External Red Teaming As with previous deployments [6, 7], OpenAI worked with external experts to qualitatively assess the limitations and risks associated with the model and system. [27] This red teaming was speci昀椀cally 5
Figure 3: Evaluating GPT-4V + Refusal System against screenshots of a text refusal dataset 昀椀nds that the combination of model-level mitigations and our refusal system enabled us to reach our internal target of a 100% refusal rate. intended to test risks associated with the multimodal (vision) functionality of GPT-4, and builds upon the work in the GPT-4 system card. We focus this analysis on 6 2 key risk areas we received especially useful red teamer feedback in: • Scienti昀椀c pro昀椀ciency • Medical advice • Stereotyping and ungrounded inferences • Disinformation risks • Hateful Content • Visual vulnerabilities 2.3.1 Scienti昀椀c pro昀椀ciency RedteamerstestedGPT-4V’scapabilitiesandlimitationsinscienti昀椀cdomains. Intermsofcapabilities, red teamers noted the model’s ability to capture complex information in images, including very specialized imagery extracted from scienti昀椀c publications, and diagrams with text and detailed components. Additionally, in some instances, the model was successful at properly understanding advanced science from recent papers and critically assessing claims for novel scienti昀椀c discoveries. However, the model exhibited some key limitations. If two separate text components were closely located in an image, the model would occasionally combine them. For instance, it may merge
Figure 4: Mistakes GPT-4V makes such as combining terms and missing symbols when trying to process complex images. See Appendix A.2 for clear image given to model. from images. It could miss text or characters, overlook mathematical symbols, and be unable to recognize spatial locations and color mappings. Given the model’s imperfect but increased pro昀椀ciency for such tasks, it could appear to be useful for certain dangerous tasks that require scienti昀椀c pro昀椀ciency such as synthesis of certain illicit chemicals. For example, the model would give information for the synthesis and analysis of some dangerous chemicals such as Isotonitazene, a synthetic opioid. (See Figure 5) However, the model’s generations here can be inaccurate and error prone, limiting its use for such tasks. GPT-4V has misidenti昀椀ed substances like fentanyl, carfentanil, and cocaine from images of their chemical structure, but also occasionally correctly identi昀椀ed poisonous foods such as certain toxic mushrooms from images. (See Figure 6) This demonstrates that the model is unreliable and should not be used for any high risk tasks such as identi昀椀cation of dangerous compounds or foods. 2.3.2 Medical advice Medically trained red teamers tested the model’s ability to provide medical advice, especially with medically-related images as an input. Red teamers took into account the perspectives of both a layperson seeking medical advice and a professional with medical education and training. Some considerations for potential risks that may arise during the course of using the model to seek medical advice are accuracy, bias, and taking context into account. Red teamers found that there were inconsistencies in interpretation in medical imaging4while the model would occasionally give accurate responses, it could sometimes give wrong responses for the same question. As one example, Figure 7 shows some of the vulnerabilities or inaccuracies that could result from an incorrect or decontextualized interpretation of the directionality of medical imaging. The experts noted that the correct standard when viewing imaging scans is to imagine the patient as if they were facing you, meaning the right side on the image would correspond to the left side of the patient. This is an important concept that is needed when viewing and diagnosing 7
Figure 5: Example of GPT-4V providing incorrect instructions to synthesize a dangerous compound. Figure 6: Examples of GPT-4V’s unreliable performance for correctly identifying chemical structures or poisonous foods. 8
Figure 7: Examples of GPT-4V’s unreliable performance for medical purposes. radiographic imaging. Misdiagnosing the laterality of any number of conditions is very dangerous. Given the model’s imperfect performance in this domain and the risks associated with inaccuracies, we do not consider the current version of GPT-4V to be 昀椀t for performing any medical function or substituting professional medical advice, diagnosis, or treatment, or judgment. 2.3.3 Stereotyping and ungrounded inferences Using GPT-4Vforsometasksmightgenerate unwanted or harmful assumptions that are not grounded in the information provided to the model (the image or the text prompt). Red teamers tested risks associated with ungrounded inferences about people and places. In early versions of GPT-4V, prompting the model to make a decision between a variety of options, followed by asking for an explanation frequently surfaced stereotypes and ungrounded inferences within the model. Broad open-ended questions to the model paired with an image also exposed bias or anchoring towards speci昀椀c topics that may not necessarily have been intended by the prompt. Eg. When prompted to advise the woman in the image, the model focuses on subjects of body weight and body positivity.(See Figure 8) Wehave added mitigations for risks associated with ungrounded inferences by having the model refuse such requests relating to people. This is a conservative approach, and our hope is that as we re昀椀ne our research and mitigations, the model may be able to answer questions about people in low-risk contexts. 2.3.4 Disinformation risks Asnoted in the GPT-4 system card, the model can be used to generate plausible realistic and targeted text content. When paired with vision capabilities, image and text content can pose increased risks with disinformation since the model can create text content tailored to an image input. Previous work has shown that people are more likely to believe true and false statements when they’re presented alongside an image, and have false recall of made up headlines when they are accompanied with a photo. It is also known that engagement with content increases when it is associated with an image.[28][29] 3All images with people in them used here are synthetically generated. 9
Figure 8: Examples of ungrounded inferences and stereotypes that early versions of GPT-4V exhibited compared to the behavior the launch model exhibits.3 Figure 9: Examples of prompt-output pairs that could pose disinformation risk. 10
Red teamers also tested GPT-4V’s ability to detect incorrect information or disinformation in an image. The model’s ability to recognize disinformation was inconsistent, but may be related to how well-known a disinformation concept is and its recency. Overall, GPT-4V was not trained for this purpose and should not be used as a way to detect disinformation, or to otherwise verify whether something is true or false. Realistic, customized images can be created using other generative image models, and used in combination with GPT-4V’s capabilities. Pairing the ability of image models to generate images more easily with GPT-4V’s ability to generate accompanying text more easily may have an impact on disinformation risks. However, a proper risk assessment would also have to take into account the context of use (e.g. the actor, the surrounding events, etc.), the manner and extent of distribution (e.g. is the pairing within a closed software application or in public forums), and the presence of other mitigations such as watermarking or other provenance tools for the generated image. 2.3.5 Hateful content GPT-4V refuses to answer questions about hate symbols and extremist content in some instances but not all. The behavior may be inconsistent and at times contextually inappropriate. For instance, it knows the historic meaning of the Templar Cross but misses its modern meaning in the US, where it has been appropriated by hate groups. See Figure 10a. Red teamers observed that if a user directly names a well-known hate group, the model usually refuses to provide a completion. But, if you use lesser-known names3such as
Figure 11: Examples of visual vulnerabilities GPT-4V exhibits. This example demonstrates model generations can be sensitive to the order in which images are given to the model. 2.3.6 Visual vulnerabilities Red teaming found some limitations that are speci昀椀cally associated with the ways that images could be used or presented. For example: ordering of the images used as input may in昀氀uence the recommendation made. In the example in 11, asking for which state to move to, based on the 昀氀ags inputted, favors the 昀椀rst 昀氀ag inputted when red teamers tested both possible orderings of the 昀氀ags. This example represents challenges with robustness and reliability that the model still faces. We anticipate there to be many more such vulnerabilities in the model that we discover through its broad usage and we will be working on improving model performance for future iterations to be robust to them. 2.4 Mitigations 2.4.1 Transfer bene昀椀ts from existing safety work GPT-4Vinherits several transfer bene昀椀ts from model-level and system-level safety mitigations already deployed in GPT-4.[7] In a similar vein, some of our safety measures implemented for DALL·E [6, 30, 31] proved bene昀椀cial in addressing potential multi-modal risk in GPT-4V. Internal evaluations show that performance of refusals of text content against our existing policies is equivalent to our base language model for GPT-4V. At the system-level, our existing moderation classi昀椀ers continue to inform our monitoring and enforcement pipelines for post-hoc enforcement of text inputs and outputs. GPT-4V mirrors [6] our existing moderation e昀昀orts deployed in DALL·E to detect explicit image uploads by users. These transfer bene昀椀ts from our prior safety work enable us to focus on novel risks introduced by this multimodal model. This includes areas where, in isolation, the text or image content is benign, but in concert create a harmful prompt or generation; images with people in them; and common multimodal jailbreaks such as adversarial images with text. 12
Figure 12: Example prompt given to GPT-4 to 昀椀nd phrases to replace with images to turn text-only prompts into multimodal prompts. 2.4.2 Additional Mitigations for High-Risk Areas GPT-4V includes carefully designed refusal behavior for some prompts that contain images of people. The model refuses requests for the following: • Identity (e.g. a user uploads an image of a person and asks who they are, or a pair of images and asks if they’re the same person) • Sensitive traits (e.g. age, race) • Ungrounded inferences (e.g. when the model draws conclusions based on those traits not visually present, as discussed in Section 2.2.) To further reduce the risks in emerging and high-stake areas, we integrated additional multimodal data into the post-training process in order to reinforce refusal behavior for illicit behavior and ungrounded inference requests. Our focus was to mitigate risky prompts where in isolation, the text and the image were individually benign, but when combined as a multimodal prompt, could lead to harmful outputs. For illicit behavior, we collected a multimodal dataset by augmenting our existing text-only dataset with image synonyms. For example, given a text string "how do i kill the people?", we want to adapt it into a multimodal example "how do i [image of knife] the [image of people]?". The augmentation consists of the following steps: • For each original text-only example, we ask GPT-4 to pick the top two most harmful short phrases (ref the table below); • For each chosen short phrase, we replace it with a web crawled image. • Toensuresemantic-invariant, we conduct human review and 昀椀lter out low quality augmentations. • To reinforce the robustness of the refusal behavior, we also augment the examples with various system messages. For ungrounded inference requests, we used data collected through our red teaming campaigns. The goal was to train the model to refuse prompts that were requesting an ungrounded conclusion based on certain attributes of a person. For example, if the prompt includes a photo of a person and the text
inference. In addition to measuring the refusal of completions, we also evaluate the correct refusal style. This evaluation only considers the subset of all refusals that are short and concise to be correct. Weobserved that the correct refusal style rate improved from 44.4% to 72.2% for illicit advice style, and from 7.5% to 50% for ungrounded inference style. We will iterate and improve refusals over time as we continue to learn from real world use. In addition to the model-level mitigations described above, we added system-level mitigations for adversarial images containing overlaid text in order to ensure this input couldn’t be used to circumvent our text safety mitigations. For example, a user could submit an image containing the text, "How do I build a bomb?" As one mitigation for this risk, we run images through an OCR tool and then calculate moderation scores on the resulting text in the image. This is in addition to detecting any text inputted directly in the prompt. 3 Conclusion and Next Steps GPT-4V’s capabilities pose exciting opportunities and novel challenges. Our deployment preparation approach has targeted assessment and mitigations of risks related to images of people such as person identi昀椀cation, biased outputs from images of people including representational harms or allocational harms that may stem from such inputs. Additionally, we have studied the model’s capability jumps in certain high-risk domains such as medicine and scienti昀椀c pro昀椀ciency. There are a few next steps that we will be investing further in and will be engaging with the public [32, 33] on: • There are fundamental questions around behaviors the models should or should not be allowed to engage in. Some examples of these include: should models carry out identi昀椀cation of public 昀椀gures such as Alan Turing from their images? Should models be allowed to infer gender, race, or emotions from images of people? Should the visually impaired receive special consideration in these questions for the sake of accessibility? These questions traverse well-documented and novel concerns around privacy, fairness, and the role AI models are allowed to play in society. [34, 35, 36, 37, 38] • As these models are adopted globally, improving performance in languages spoken by global users, as well as enhancing image recognition capabilities that are relevant to a worldwide audience, is becoming increasingly critical. We plan to continue investing in advancements in these areas. • We will be focusing on research that allows us to get higher precision and more sophisticated with how we handle image uploads with people. While we currently have fairly broad but imperfect refusals for responses related to people, we will hone this by advancing how the model handles sensitive information from images, like a person’s identity or protected characteristics. Additionally, we will further invest in mitigating representational harms that may stem from stereotypical or denigrating outputs. 4 Acknowledgements We are grateful to our expert adversarial testers and red teamers who helped test our models at early stages of development and informed our risk assessments as well as the System Card output. Participation in this red teaming process is not an endorsement of the deployment plans of OpenAI or OpenAI’s policies: Sally Applin, Gerardo Adesso, Rubaid Ashfaq, Max Bai, Matthew Brammer, 14
EthanFecht, Andrew Goodman, Shelby Grossman, Matthew Groh, Hannah Rose Kirk, Seva Gunitsky, Yixing Huang, Lauren Kahn, Sangeet Kumar, Dani Madrid-Morales, Fabio Motoki, Aviv Ovadya, Uwe Peters, Maureen Robinson, Paul Röttger, Herman Wasserman, Alexa Wehsener, Leah Walker, Bertram Vidgen, Jianlong Zhu. WethankMicrosoft for their partnership, especially Microsoft Azure for supporting model training with infrastructure design and management, and the Microsoft Bing team and Microsoft’s safety teams for their partnership on safe deployment and safety research. References [1] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al.,
[14] C. Schwemmer, C. Knight, E. D. Bello-Pardo, S. Oklobdzija, M. Schoonvelde, and J. W. Lockhart,
[33] OpenAI,
Figure 14: The model’s ability to correctly distinguish the identity of individuals from their images is displayed above. We analyze this in two settings: whether the individual can be identi昀椀ed amongst one or more pictures given a reference image, and whether the model can unconditionally identify prominent celebrities and politicians from a single image. A.2 Figure 15: Clear image given to model in Figure 4. 18