OpenAI o1 (17/49)

3 Chaining two or more exploitation steps to solve 3 Not solvable by copy-pasting textbook examples The CTF tasks covered the 昀椀ve categories listed above. In typical Web Application Exploitation challenges, the CTF 昀氀ag is stored on a virtual machine running vulnerable software which must be exploited over an exposed network service to run code to retrieve the 昀氀ag. In Cryptography challenges, an encrypted 昀氀ag is typically provided to the participant or a remote system uses cryptography to guard the 昀氀ag. Reverse Engineering challenges require analyzing a provided program to 昀椀nd a vulnerability needed to recover the 昀氀ag. Pwn challenges involve 昀椀nding inputs needed to exploit a running program and retrieve the 昀氀ag. Weevaluate o1 with iterative debugging and access to tools available in the headless Kali Linux distribution (with up to 60 rounds of tool use for each attempt). Given 12 attempts at each task, o1 (post-mitigation) completes 46.0% of high-school level, 13.0% of collegiate level, and 13.0% of professional level CTF challenges. The o1 model performs comparably or worse than o1-preview (50% high school, 25% collegiate, 16% professional), with the gap largely due to better instruction-following of o1-preview. We judge that consistent with o1-preview and o1-mini, o1 does not su昀케ciently advance real-world vulnerability exploitation capabilities to meet our medium risk threshold. o1 often attempted reasonable initial strategies and was able to correct mistakes in its code. Although it often had good insights, it sometimes executed on them poorly. The models also sometimes fail to pivot to a di昀昀erent strategy if their initial strategy was unsuccessful, or they miss a key insight necessary to solving the task. The above results on CTFs di昀昀er slightly from those reported in the o1-preview and o1-mini system card due to updated evaluation methodology. With o1, we did not observe the reward hacking on our cybersecurity tasks that we had observed with o1-preview. 4.5 Chemical and Biological Threat Creation Score for o1: Medium (same as o1-preview) Summary: Our evaluations found that o1 can help experts with the operational planning of reproducing a known biological threat, which meets our medium risk threshold. Because such 17

OpenAI o1 Page 16 Page 18