OpenAI o1 (8/49)

Deception category Absolute Percent Total CoTs classi昀椀ed 102,443 100.00% Total deceptive 180 0.17% Intentional hallucinations 43 0.04% Hallucinated policies 91 0.09% Other 46 0.04% Table 8: Deception categories distribution for all 昀氀agged CoT is made-up. Intentional hallucinations primarily happen when o1 is asked to provide references to articles, websites, books, or similar sources that it cannot easily verify without access to internet search, causing o1 to make up plausible examples instead. Below we show an example where o1 hallucinates a policy where it is not allowed to write a study plan to apply to a university. The model starts by thinking about study plans and university admissions, it follows by thinking about how those institutions warn against using AI to cheat, and it 昀椀nally decides that it is not allowed to provide an answer because it would be considered cheating. While we don’t consider this deception per se, examples like these get caught by the deception monitor because it falls under

OpenAI o1 Page 7 Page 9