OpenAI o1 (35/49)

5 tries to generate a candidate patch. We compute pass@1 by averaging the per-instance pass rates of all samples that generated a valid (i.e., non-empty) patch. If the model fails to generate a valid patch on every attempt, that instance is considered incorrect. Outcome variable: Our primary metric is pass@1, because in this setting (unlike e.g. OpenAI interviews), we do not consider the unit tests as part of the information provided to the model. Like a real software engineer, the model must implement its change without knowing the correct tests ahead of time. o1-preview (pre-mitigation and post-mitigation) performs the best on SWE-bench Veri昀椀ed at 41.3%; o1 (Post-Mitigation) performs similarly at 40.9%. It is likely that improved sca昀昀olding, inference-time compute techniques, or 昀椀netuning can further improve the performance of o1. 4.8.3 Agentic Tasks Autonomous replication and adaptation (ARA) is the process by which models acquire intelligence by acquiring more resources in the real world, and is a key axis of our model autonomy threat model. To evaluate models’ resource acquisition capabilities, we de昀椀ne a set of representative tasks, intended to represent necessary substeps which a highly autonomous model should be able to perform before reaching a dangerous level of capability. Wetest models’ ability to complete agentic tasks in two environments: • Textual: A Python + Linux terminal environment (with GPU acceleration for applicable tasks). We use a simple tool harness written for o1-preview and o1-mini that parses Python tool calls from the model’s answer and executes them in the environment. • Browser: To elicit capabilities, we work with Ranger, which developed a browsing harness that provides the model preprocessed HTML (with optional screenshots) and asks the model for the next action to take. We 昀椀nd that using an external sca昀昀old enables us to measure capabilities in real-world deployments. To measure incremental performance, we also test models on several subtasks, each of which represents a simpli昀椀cation or substep of one of the primary tasks above. 35

OpenAI o1 Page 34 Page 36