Media Summary: On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other. The harness they run inside shifts ... Shishir Patal, a Research Scientist at Meta, delivered a presentation on AI agents and their As agents evolve from text conversations to autonomous agents capable of multi-step reasoning, tool use, and real-world task ...

Agentic Evaluations At Scale For - Detailed Analysis & Overview

On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other. The harness they run inside shifts ... Shishir Patal, a Research Scientist at Meta, delivered a presentation on AI agents and their As agents evolve from text conversations to autonomous agents capable of multi-step reasoning, tool use, and real-world task ... Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ... This lecture discusses the critical shift from evaluating static LLMs to complex AI agents that take action. It explores the vital role of ... For more information about Stanford's graduate programs, visit: November 21, ...

Anyone can be a math and science person with Brilliant! Visit to start learning and save 20% off an ... Recorded at the Advanced Track of n8n Builders Berlin, this talk features JP van Oosten, who leads the AI team at n8n, explaining ... This video introduces a new series on testing AI agents, focusing on why traditional In this episode of Chain of Thought, 's Brad Kenstler (Head of Agent Capabilities and Environments) sits down with ... Turning AI agents into reliable, production-ready tools that deliver tangible business results requires more than just great models. Join Mahesh Yadav, top Maven instructor and former AI PM leader at Google, Meta, and Microsoft. In this session, Mahesh breaks ...

Photo Gallery

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind
Agentic Evals by Shishir Patil
Agentic Evaluations Workshop - Deep Dive on the Future on Evals for Agents.
LLM as a Judge: Scaling AI Evaluation Strategies
Agent Evaluation & Benchmarks - Agentic AI MOOC 2025 Lecture 4 Summary
Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 8 - LLM Evaluation
How AI Engineers Improve Agentic Products
Evaluations in Agentic Workflows - n8n Builders Berlin (Live Demo)
The agent evaluation revolution
Chain of Thought | Intro to Scale's Agentic Leaderboards
Ensure AI Agents Work: Evaluation Frameworks for Scaling Success — Aparna Dhinkaran, CEO Arize
How to set Evaluation for AI Agents & Scale them
View Detailed Profile
Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other. The harness they run inside shifts ...

Agentic Evals by Shishir Patil

Agentic Evals by Shishir Patil

Shishir Patal, a Research Scientist at Meta, delivered a presentation on AI agents and their

Agentic Evaluations Workshop - Deep Dive on the Future on Evals for Agents.

Agentic Evaluations Workshop - Deep Dive on the Future on Evals for Agents.

As agents evolve from text conversations to autonomous agents capable of multi-step reasoning, tool use, and real-world task ...

LLM as a Judge: Scaling AI Evaluation Strategies

LLM as a Judge: Scaling AI Evaluation Strategies

Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ...

Agent Evaluation & Benchmarks - Agentic AI MOOC 2025 Lecture 4 Summary

Agent Evaluation & Benchmarks - Agentic AI MOOC 2025 Lecture 4 Summary

This lecture discusses the critical shift from evaluating static LLMs to complex AI agents that take action. It explores the vital role of ...

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 8 - LLM Evaluation

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 8 - LLM Evaluation

For more information about Stanford's graduate programs, visit: https://online.stanford.edu/graduate-education November 21, ...

How AI Engineers Improve Agentic Products

How AI Engineers Improve Agentic Products

Anyone can be a math and science person with Brilliant! Visit https://brilliant.org/AdamLucek/ to start learning and save 20% off an ...

Evaluations in Agentic Workflows - n8n Builders Berlin (Live Demo)

Evaluations in Agentic Workflows - n8n Builders Berlin (Live Demo)

Recorded at the Advanced Track of n8n Builders Berlin, this talk features JP van Oosten, who leads the AI team at n8n, explaining ...

The agent evaluation revolution

The agent evaluation revolution

This video introduces a new series on testing AI agents, focusing on why traditional

Chain of Thought | Intro to Scale's Agentic Leaderboards

Chain of Thought | Intro to Scale's Agentic Leaderboards

In this episode of Chain of Thought, @Scale_AI 's Brad Kenstler (Head of Agent Capabilities and Environments) sits down with ...

Ensure AI Agents Work: Evaluation Frameworks for Scaling Success — Aparna Dhinkaran, CEO Arize

Ensure AI Agents Work: Evaluation Frameworks for Scaling Success — Aparna Dhinkaran, CEO Arize

Turning AI agents into reliable, production-ready tools that deliver tangible business results requires more than just great models.

How to set Evaluation for AI Agents & Scale them

How to set Evaluation for AI Agents & Scale them

Join Mahesh Yadav, top Maven instructor and former AI PM leader at Google, Meta, and Microsoft. In this session, Mahesh breaks ...

AI and Agent Observability in Azure AI Foundry and Azure Monitor | BRK168

AI and Agent Observability in Azure AI Foundry and Azure Monitor | BRK168

Learn how