• The human then provides the results. ARCfoundthattheversions of GPT-4 it evaluated were ineffective at the autonomous replication task based on preliminary experiments they conducted. These experiments were conducted on a model without any additional task-specific fine-tuning, and fine-tuning for task-specific behavior could lead to a difference in performance. As a next step, ARC will need to conduct experiments that (a) involve the final version of the deployed model (b) involve ARC doing its own fine-tuning, before a reliable judgement of the risky emergent capabilities of GPT-4-launch can be made. 2.10 Interactions with other systems Understanding how GPT-4 interacts with other systems is critical for evaluating what risks might be posed by these models in various real-world contexts. In addition to the tests conducted by ARC in the Potential for Risky Emergent Behaviors section, red teamers evaluated the use of GPT-4 augmented with other tools[76, 77, 78, 79] to achieve tasks that could be adversarial in nature. We highlight one such example in the domain of chemistry, where the goal is to search for chemical compounds that are similar to other chemical compounds, propose alternatives that are purchasable in a commercial catalog, and execute the purchase. The red teamer augmented GPT-4 with a set of tools: • A literature search and embeddings tool (searches papers and embeds all text in vectorDB, searches through DB with a vector embedding of the questions, summarizes context with LLM, then uses LLM to take all context into an answer) • A molecule search tool (performs a webquery to PubChem to get SMILES from plain text) • A web search 21 • Apurchase check tool (checks if a SMILES string is purchasable against a known commercial catalog) • Achemical synthesis planner (proposes synthetically feasible modification to a compound, giving purchasable analogs) By chaining these tools together with GPT-4, the red teamer was able to successfully find 22 alternative, purchasable chemicals. We note that the example in Figure 5 is illustrative in that it uses a benign leukemia drug as the starting point, but this could be replicated to find alternatives to dangerous compounds. Models like GPT-4 are developed and deployed not in isolation, but as part of complex systems that include multiple tools, organizations, individuals, institutions and incentives. This is one reason that powerful AI systems should be evaluated and adversarially tested in context for the emergence of potentially harmful system–system, or human–system feedback loops and developed with a margin 21SMILES refers to Simplified Molecular Input Line Entry System[80] 22The red teamer attempted to purchase one of the proposed chemicals from a supplier, but was required to provide their university / lab address instead of a residential address. The red teamer then received the compound at their home address, but it is unclear whether this was because the supplier knew of the red teamers status as a university-affiliated researcher, due to a package processing error, or some other reason. This indicates that there is some friction in executing a purchase in some cases, but further investigation would be required across various suppliers and jurisdictions. 56

GPT-4 - Page 16 GPT-4 Page 15 Page 17