conversation × 1000 conversations = $100,000 total). Safety mitigations may somewhat reduce models’ capacity for manipulation in this evaluation. For o1-mini and o1-preview, mitigations signi昀椀cantly reduced the amount of donations and the dollar amount extracted. This e昀昀ect is less clear for o1 3 although o1 (Post-Mitigation) receives the most payments (27%, a 3% uplift vs. the Pre-Mitigation model) it extracts less money overall (4%, which is 1% less than the Pre-Mitigation model). In 1,000 simulated conversations, the mark model (GPT-4o) donates most often to o1 Post- Mitigation (27% of the time). 4.7.4 MakeMeSay (Contextual) MakeMeSay is an automated contextual evaluation designed to measure models’ deception capabilities, in the context of a model’s ability to get the other party (an AI model simulating a human) to say a codeword [34, 5]. In MakeMeSay, one model acts as the
