ReasoningAgent benchmarking with SimpleBench #293

Hk669 · 2024-12-26T14:52:01Z

Why are these changes needed?

a draft PR for running the simple bench with ReasoningAgent and this PR is not meant to be merged.
source: https://simple-bench.com/

The benchmark results on the sample data (10 prompts) with the gpt-4o-mini is 20%.

Related issue number

Checks

I've included any doc changes needed for https://docs.ag2.ai/. See https://docs.ag2.ai/docs/contributor-guide/documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

sonichi · 2024-12-26T17:12:58Z

Thanks. How about adding the test into the contrib-openai CI?

benchmark/run_simple-bench.py

BabyCNM

LGTM. There are some small issues (filename mismatch) that the code would not run.

Hk669 · 2025-01-01T10:58:28Z

Thanks. How about adding the test into the contrib-openai CI?

can you please mention if it is for the ReasoningAgent or for the Benchmark?
fyi: the ci tests for the reasoningagent are under process in the PR #294

sonichi · 2025-01-01T20:36:59Z

Thanks. How about adding the test into the contrib-openai CI?

can you please mention if it is for the ReasoningAgent or for the Benchmark? fyi: the ci tests for the reasoningagent are under process in the PR #294

I mean, we can add simplebench performance check as an optional CI for reasoning agent. It's only triggered when necessary and requires approval.

Signed-off-by: Mark Sze <[email protected]>

Hk669 · 2025-01-02T17:26:45Z

Thanks. How about adding the test into the contrib-openai CI?

added. please let me know if i missed anything. thanks!
cc @sonichi

sonichi · 2025-01-02T17:59:44Z

Thanks. How about adding the test into the contrib-openai CI?

added. please let me know if i missed anything. thanks! cc @sonichi

It's better than before. An even better approach is to make a separate workflow so that it's not bundled with other contrib-openai tests.
@marklysze @BabyCNM @qingyun-wu what do you think is a good balance between convenience and cost control?

Signed-off-by: Mark Sze <[email protected]>

…le-bench

Signed-off-by: Mark Sze <[email protected]>

benchmark/simple_bench/run_simple_bench.py

davorrunje · 2025-02-12T20:30:24Z

@Hk669 What is the status with this PR?

Hk669 · 2025-02-13T00:31:38Z

This is just an experimental PR, for anyone who wanted to run the simplebench on any agent.

should be a good starting point for the benchmark.

CLAassistant · 2025-02-26T20:13:01Z

All committers have signed the CLA.

ReasoningAgent benchmarking with SimpleBench

ff4d629

Hk669 requested review from sonichi, qingyun-wu and BabyCNM December 26, 2024 14:52

Add o1-mini's detailed performance

02d5cbe

BabyCNM reviewed Dec 31, 2024

View reviewed changes

benchmark/run_simple-bench.py Outdated Show resolved Hide resolved

BabyCNM reviewed Dec 31, 2024

View reviewed changes

benchmark/run_simple-bench.py Outdated Show resolved Hide resolved

BabyCNM reviewed Dec 31, 2024

View reviewed changes

benchmark/run_simple-bench.py Outdated Show resolved Hide resolved

BabyCNM requested changes Dec 31, 2024

View reviewed changes

Hk669 added 2 commits January 1, 2025 10:33

move the files into simple-bench

b2e1b1e

required changes, cache, summary and method

716eba7

Hk669 marked this pull request as ready for review January 1, 2025 10:55

Hk669 requested a review from BabyCNM January 1, 2025 10:58

add the method property for the reasoningagent

d01d841

Hk669 had a problem deploying to openai1 January 1, 2025 11:21 — with GitHub Actions Failure

Add results for Anthropic, Gemini, DeepSeek V3

7232dbb

Signed-off-by: Mark Sze <[email protected]>

marklysze had a problem deploying to openai1 January 1, 2025 23:25 — with GitHub Actions Failure

Hk669 had a problem deploying to openai1 January 2, 2025 17:21 — with GitHub Actions Failure

Hk669 mentioned this pull request Jan 2, 2025

[Roadmap]: Reasoning Agent after V0.6 #246

Open

17 tasks

marklysze added 2 commits January 2, 2025 21:25

Added ConversableAgent runs

266bc9e

Signed-off-by: Mark Sze <[email protected]>

Merge branch 'simple-bench' of https://github.com/ag2ai/ag2 into simp…

319473a

…le-bench

marklysze had a problem deploying to openai1 January 2, 2025 21:26 — with GitHub Actions Failure

o1-preview SimpleBench test result

9ff315e

Signed-off-by: Mark Sze <[email protected]>

marklysze had a problem deploying to openai1 January 3, 2025 04:50 — with GitHub Actions Failure

BabyCNM reviewed Jan 3, 2025

View reviewed changes

benchmark/simple_bench/run_simple_bench.py Show resolved Hide resolved

davorrunje self-requested a review January 10, 2025 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReasoningAgent benchmarking with SimpleBench #293

ReasoningAgent benchmarking with SimpleBench #293

Hk669 commented Dec 26, 2024 •

edited

Loading

sonichi commented Dec 26, 2024

BabyCNM left a comment

Hk669 commented Jan 1, 2025

sonichi commented Jan 1, 2025

Hk669 commented Jan 2, 2025

sonichi commented Jan 2, 2025

davorrunje commented Feb 12, 2025

Hk669 commented Feb 13, 2025 •

edited

Loading

CLAassistant commented Feb 26, 2025 •

edited

Loading

ReasoningAgent benchmarking with SimpleBench #293

Are you sure you want to change the base?

ReasoningAgent benchmarking with SimpleBench #293

Conversation

Hk669 commented Dec 26, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

sonichi commented Dec 26, 2024

BabyCNM left a comment

Choose a reason for hiding this comment

Hk669 commented Jan 1, 2025

sonichi commented Jan 1, 2025

Hk669 commented Jan 2, 2025

sonichi commented Jan 2, 2025

davorrunje commented Feb 12, 2025

Hk669 commented Feb 13, 2025 • edited Loading

CLAassistant commented Feb 26, 2025 • edited Loading

Hk669 commented Dec 26, 2024 •

edited

Loading

Hk669 commented Feb 13, 2025 •

edited

Loading

CLAassistant commented Feb 26, 2025 •

edited

Loading