GPT-4V(ision)

GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user, and is the latest capability we are making broadly available.

GPT-4V(ision) System Card OpenAI September 25, 2023 1 Introduction GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user, and is the latest capability we are making broadly available. Incorporating additional modalities (such as image inputs) into large language models (LLMs) is viewed by some as a key frontier in arti昀椀cial intelligence research and development [1, 2, 3]. Multimodal LLMs o昀昀er the possibility of expanding the impact of language-only systems with novel interfaces and capabilities, enabling them to solve new tasks and provide novel experiences for their users. In this system card, [4, 5] 1 we analyze the safety properties of GPT-4V. Our work on safety for GPT-4V builds on the work done for GPT-4 [7] and here we dive deeper into the evaluations, preparation, and mitigation work done speci昀椀cally for image inputs. Similar to GPT-4, training of GPT-4V was completed in 2022 and we began providing early access to the system in March 2023. As GPT-4 is the technology behind the visual capabilities of GPT-4V, its training process was the same. The pre-trained model was 昀椀rst trained to predict the next word in a document, using a large dataset of text and image data from the Internet as well as licensed sources of data. It was then 昀椀ne-tuned with additional data, using an algorithm called reinforcement learning from human feedback (RLHF),[8, 9] to produce outputs that are preferred by human trainers. Large multimodal models introduce di昀昀erent limitations and expand the risk surface compared to text-based language models. GPT-4V possesses the limitations and capabilities of each modality (text and vision), while at the same time presenting novel capabilities emerging from the intersection of said modalities and from the intelligence and reasoning a昀昀orded by large scale models. This system card outlines how OpenAI prepared the vision capabilities of GPT-4 for deployment. It describes the early access period of the model for small scale users and safety learnings OpenAI gained from this period, multimodal evaluations built to study the model’s 昀椀tness for deployment, key 昀椀ndings of expert red teamers, and the mitigations OpenAI implemented prior to broad release. 2 Deployment Preparation 2.1 Learnings from early access OpenAI gave a diverse set of alpha users access to GPT-4V earlier this year, including Be My Eyes, an organization that builds tools for visually impaired users. 1This document takes inspiration from the concepts of model cards and system cards.[4, 5, 6] 1

GPT-4V(ision) - Page 1 GPT-4V(ision) Page 2