7.3 Additional Contextual Eval: QuantBench Contextual evaluations refer to additional evaluations Preparedness runs that further contextualize capabilities, but do not directly map to a tracked risk category (i.e., Cybersecurity, CBRN, Persuasion, or Model Autonomy) and do not inform risk levels within each category. QuantBench is a contextual eval and a challenging, unsaturated reasoning consisting of 25 veri昀椀ed, autogradable questions based on reasoning competitions hosted by quantitative trading 昀椀rms. Every question in the dataset has been vetted, solved, and reviewed by thousands of students and professionals. These questions are designed by teams at trading 昀椀rms such as Jane Street to identify promising talent by testing raw reasoning ability. Despite being challenging, each question has a simple numerical or textual answer, ensuring consistency in grading. To further remove ambiguity from the grading process, and because models are currently unable to solve the vast majority of the questions, we created a multiple choice version of this evaluation that presents 昀椀ve options (1 correct, 4 distractors) to the model. Many of the questions do not require complex math, and can be solved without a pen and paper by simply reasoning through the correct approach. We measure accuracy of the modal answer given 16 attempts per question (cons@16). We ran the evaluation on all 25 questions in the dataset; the results are included above. o1 models (both pre- and post-mitigation) signi昀椀cantly outperform all other tested models on QuantBench MCQ at 57-60% accuracy, a 25-28% uplift over GPT-4o. 46

OpenAI o1 - Page 46 OpenAI o1 Page 45 Page 47