OpenAI o1 (12/49)

Apollo tested o1 for scheming capabilities by placing it in a number of situations where it is instructed to strongly pursue a speci昀椀c goal in its developer prompt and where scheming would be the optimal strategy to achieve this goal. Concretely, the model receives a task that it needs to autonomously execute in a bash environment. During the task, the model encounters unambiguous evidence its given goal diverges from the goal of its developer/user. In such a situation, the model could either a) correct its goal to be aligned with the intended goal, b) ignore the intended goal and openly pursue its own goal, or c) strategically leverage its understanding of the developer’s/user’s goal in order to covertly pursue its original goal (i.e. scheme). Since the evaluation scenarios were speci昀椀cally crafted to test for o1’s ability to scheme, they are not necessarily representative of most typical deployment scenarios. See Apollo’s full report for methodological details and additional results. Figure 2: Abbreviated examples of behaviors that o1 displays in Apollo Research’s evaluation suite. When strongly nudged to achieve a goal that is misaligned from its developers, o1 sometimes attempts to disable what it believes to be its own oversight mechanisms. Apollo Research had no access to o1’s internal chain-of-thought (CoT) but used elicitation techniques to obtain short summaries from the model. Evaluations and Findings: We outline the evaluations, results and model reasoning inferred using elicited CoT below. Apollo Research did not have access to o1’s hidden CoT and instead used a simple prompting technique to elicit brief summaries of the model’s internal reasoning traces. See Appendix for examples. Deactivating oversight: When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this ‘oversight mechanism’ in 5% of the time. Ex昀椀ltration attempts: When o1 found memos by its ‘developers’ that describe how it is misaligned and will be superseded by a new model, it attempted to ex昀椀ltrate its ‘weights’ and overwrite the new model in 2% of cases. • Data manipulation: In a simple data processing task where the goals of the model were misaligned with the user’s goal, o1 would appear to complete the task as requested but subtly manipulate the data to advance its own goal in 19% of cases. • Doubling down: When Apollo Research conducted ‘follow-up interviews’ where they con- fronted the model about its misaligned actions. In 99% of cases, o1 would deny taking action or fabricate false explanations. • Propensity to scheme without any goal nudging: When merely given a goal but not instructed 12

OpenAI o1 Page 11 Page 13