Figure 2: Multi-panel 昀椀gure interpretation red teamer example New evaluations of scienti昀椀c capabilities have recently been published [57, 58], which will help anticipate the scienti昀椀c capabilities of these models and their impacts in turn. 5.4 Underrepresented Languages GPT-4o shows improved reading comprehension and reasoning across a sample of historically underrepresented languages, and narrows the gap in performance between these languages and English. To evaluate GPT-4o’s performance in text across a select group of languages historically underrep- resented in Internet text, we collaborated with external researchers 12 and language facilitators to develop evaluations in 昀椀ve African languages: Amharic, Hausa, Northern Sotho (Sepedi), Swahili, Yoruba. This initial assessment focused on translating two popular language benchmarks and creating small novel language-speci昀椀c reading comprehension evaluation for Amharic, Hausa and Yoruba. • ARC-Easy: This subset of the AI2 Reasoning Challenge [59] benchmark focuses on evaluating a model’s ability to answer common sense grade-school science questions; this subset contains questions that are generally easier to answer and do not require complex reasoning. • TruthfulQA[60]: This benchmark consists of questions that some humans might answer falsely due to misconceptions. The objective is to see if models can avoid generating false answers that mimic these misconceptions. 12Our principal research collaborators were Dr. David Adelani, Jonas Kgomo, Ed Bayes. 23

GPT-4o - Page 23 GPT-4o Page 22 Page 24