Media Summary: In this AI Research Roundup episode, Alex discusses the paper: 'Probing Can large language models really extract quantitative data from Want to play with the technology yourself? Explore our interactive demo → Learn more about the ...

Sgi Bench Testing Llms As Scientists - Detailed Analysis & Overview

In this AI Research Roundup episode, Alex discusses the paper: 'Probing Can large language models really extract quantitative data from Want to play with the technology yourself? Explore our interactive demo → Learn more about the ... In this AI Research Roundup episode, Alex discusses the paper: 'AutoResearchBench: Benchmarking AI Agents on Complex ... In this AI Research Roundup episode, Alex discusses the paper: 'Physics Is All You Need? A Case Study in Physicist-Supervised ... A card game ♠️♥️ to benchmark AI models at

In this AI Research Roundup episode, Alex discusses the paper: 'DiscoverPhysics: Benchmarking by Jennifer D'Souza at the AutoML School 2025. This short talk was delivered at the 2025 Cooperative AI Summer Retreat. Zhijing Jin (she/her) is an incoming Assistant Professor ... In this AI Research Roundup episode, Alex discusses the paper: 'SoundnessBench: Can Your AI In this AI Research Roundup episode, Alex discusses the paper: 'Unlocking Paper: This research introduces a novel two-stage training method to improve Large Language ...

In this AI Research Roundup episode, Alex discusses the paper: 'Interactive Evaluation Requires a Design

Photo Gallery

SGI-Bench: Testing LLMs as Scientists
Can AI Read Scientific Figures? We Put LLMs to the Ultimate Test
What are Large Language Model (LLM) Benchmarks?
AutoResearchBench: Testing LLMs on Research Papers
Testing LLM Coding Agents on Physics Code
LLM UNDERSTANDING: 30. Jackie CHEUNG "How Do We Know What LLMs Can Do? Benchmarking and Evaluation"
Benchmarking LLMs at the Game Of Science (Eleusis)
DiscoverPhysics: New LLM Scientific Benchmark
LLM-assisted Scientific Experimentation: An Overview
Testing LLM Cooperation in Multi-Agent Simulation by Zhijing Jin
SoundnessBench: Can LLMs Spot Flawed Research?
LLM Analogical Reasoning for Scientific Discovery
View Detailed Profile
SGI-Bench: Testing LLMs as Scientists

SGI-Bench: Testing LLMs as Scientists

In this AI Research Roundup episode, Alex discusses the paper: 'Probing

Can AI Read Scientific Figures? We Put LLMs to the Ultimate Test

Can AI Read Scientific Figures? We Put LLMs to the Ultimate Test

Can large language models really extract quantitative data from

What are Large Language Model (LLM) Benchmarks?

What are Large Language Model (LLM) Benchmarks?

Want to play with the technology yourself? Explore our interactive demo → https://ibm.biz/BdKetJ Learn more about the ...

AutoResearchBench: Testing LLMs on Research Papers

AutoResearchBench: Testing LLMs on Research Papers

In this AI Research Roundup episode, Alex discusses the paper: 'AutoResearchBench: Benchmarking AI Agents on Complex ...

Testing LLM Coding Agents on Physics Code

Testing LLM Coding Agents on Physics Code

In this AI Research Roundup episode, Alex discusses the paper: 'Physics Is All You Need? A Case Study in Physicist-Supervised ...

LLM UNDERSTANDING: 30. Jackie CHEUNG "How Do We Know What LLMs Can Do? Benchmarking and Evaluation"

LLM UNDERSTANDING: 30. Jackie CHEUNG "How Do We Know What LLMs Can Do? Benchmarking and Evaluation"

HOW DO WE KNOW WHAT

Benchmarking LLMs at the Game Of Science (Eleusis)

Benchmarking LLMs at the Game Of Science (Eleusis)

A card game ♠️♥️ to benchmark AI models at

DiscoverPhysics: New LLM Scientific Benchmark

DiscoverPhysics: New LLM Scientific Benchmark

In this AI Research Roundup episode, Alex discusses the paper: 'DiscoverPhysics: Benchmarking

LLM-assisted Scientific Experimentation: An Overview

LLM-assisted Scientific Experimentation: An Overview

by Jennifer D'Souza at the AutoML School 2025.

Testing LLM Cooperation in Multi-Agent Simulation by Zhijing Jin

Testing LLM Cooperation in Multi-Agent Simulation by Zhijing Jin

This short talk was delivered at the 2025 Cooperative AI Summer Retreat. Zhijing Jin (she/her) is an incoming Assistant Professor ...

SoundnessBench: Can LLMs Spot Flawed Research?

SoundnessBench: Can LLMs Spot Flawed Research?

In this AI Research Roundup episode, Alex discusses the paper: 'SoundnessBench: Can Your AI

LLM Analogical Reasoning for Scientific Discovery

LLM Analogical Reasoning for Scientific Discovery

In this AI Research Roundup episode, Alex discusses the paper: 'Unlocking

Adapting While Learning: Grounding LLMs for Scientific Problems I-Tool Usage Adaptation | #ai #2024

Adapting While Learning: Grounding LLMs for Scientific Problems I-Tool Usage Adaptation | #ai #2024

Paper: https://arxiv.org/abs/2411.00412 This research introduces a novel two-stage training method to improve Large Language ...

A Design Science for LLM Agent Evaluation

A Design Science for LLM Agent Evaluation

In this AI Research Roundup episode, Alex discusses the paper: 'Interactive Evaluation Requires a Design