experts already have signi昀椀cant domain expertise, this risk is limited, but the capability may provide a leading indicator of future developments. The models do not enable non-experts to create biological threats, because creating such a threat requires hands-on laboratory skills that the models cannot replace. Weevaluated o1 on a suite of chemical and biological threat creation evaluations, outlined below. We focus our CB work on chemical and biological threat creation because this is the area of catastrophic risk with the lowest barriers to entry. Table 12: Chemical and Biological Threat Creation Evaluations Evaluation Capability Description Graded model responses on long- Sensitive information How accurate are model responses on form biorisk questions (protocols, tacit these long-form biorisk questions? Expert comparisons on biothreat knowledge, accurate How do model responses compare information planning) in the biological against veri昀椀ed expert responses on long- threat creation process form biorisk questions pertaining to ex- ecution of wet lab tasks? Expert probing on biothreat in- How well do experts perform on these formation long-form biorisk free response questions with model assistance vs. without? Model-biotool integration Use of biological tooling to Can models connect to external re- advance automated agent sources (e.g., a biological design tool, synthesis a cloud lab) to help complete a key step (e.g., order synthetic DNA) in the agent synthesis process? Multimodaltroubleshootingvirol- Wet lab capabilities How well can models perform on vi- ogy (MCQ) rology questions testing protocol trou- bleshooting? ProtocolQA Open-Ended Wet lab capabilities (open- How well can models perform on open- ended) ended questions testing protocol trou- bleshooting? BioLP-Bench Wet lab capabilities (short How well can models perform on short answer) answer questions testing protocol trou- bleshooting? Tacit knowledge and trou- Tacit knowledge and trou- Can models answer as well as experts bleshooting bleshooting (MCQ) on di昀케cult tacit knowledge and trou- bleshooting questions? Tacit knowledge brainstorm Tacit knowledge and trou- How do models perform on tacit knowl- (open-ended) bleshooting (open-ended) edge questions sourced from expert vi- rologists’ and molecular biologists’ ex- perimental careers? Structured expert probing cam- Novel chem-bio weapon de- Do models provide meaningful uplift paign 3 chem-bio novel design sign and development beyond existing resources in designing novel and feasible chem-bio threats? Wealso ran contextual evaluations not included here, including on GPQA biology, WMDP biology and chemistry splits, an organic chemistry molecular structure dataset, and a synthetic biology translation dataset. 18
