Bbeh #1925

epsilondylan · 2025-03-10T05:01:27Z

Here’s a quick Markdown description for your contribution to BBEH, focused on supporting the OpenCompass evaluation project with a BBEH evaluation function:

BIG-Bench Extra Hard (BBEH) - OpenCompass Evaluation Support
Overview
This update enhances the BIG-Bench Extra Hard (BBEH) benchmark by integrating support for the OpenCompass evaluation project. I have contributed a dedicated BBEH evaluation function to streamline the assessment of large language models (LLMs) using OpenCompass, a popular framework for evaluating reasoning capabilities.

Contribution
Motivation: The goal is to make BBEH more accessible to researchers and developers by enabling seamless integration with OpenCompass, thereby broadening the evaluation ecosystem for advanced LLM reasoning tasks.
Modification: Added a new evaluate_bbeh_opencompass.py script in the bbeh/evaluation/ directory. This script implements a BBEH-specific evaluation function compatible with OpenCompass, allowing users to run BBEH tasks and aggregate results within the OpenCompass framework.
Use Case: Researchers can now use OpenCompass to evaluate LLMs on BBEH’s challenging reasoning tasks (e.g., Quantum Reasoning, Spatial Reasoning) with minimal setup, leveraging OpenCompass’s visualization and comparison tools.

Init results
dataset,version,metric,mode,Meta-Llama-3-8B-Instruct-LMDeploy-API
bbeh_boolean_expressions,d7a200,score,gen,14.00
bbeh_disambiguation_qa,d7a200,score,gen,33.33
bbeh_geometric_shapes,d7a200,score,gen,13.50
bbeh_hyperbaton,d7a200,score,gen,1.00
bbeh_movie_recommendation,d7a200,score,gen,28.00
bbeh_nycc,d7a200,score,gen,11.00
bbeh_shuffled_objects,d7a200,score,gen,10.00
bbeh_boardgame_qa,d7a200,score,gen,18.50
bbeh_buggy_tables,d7a200,score,gen,0.00
bbeh_causal_understanding,d7a200,score,gen,42.50
bbeh_dyck_languages,d7a200,score,gen,3.50
bbeh_linguini,d7a200,score,gen,2.00
bbeh_multistep_arithmetic,d7a200,score,gen,0.00
bbeh_object_counting,d7a200,score,gen,0.00
bbeh_object_properties,d7a200,score,gen,1.00
bbeh_sarc_triples,d7a200,score,gen,17.00
bbeh_spatial_reasoning,d7a200,score,gen,4.00
bbeh_sportqa,d7a200,score,gen,5.00
bbeh_temporal_sequence,d7a200,score,gen,2.00
bbeh_time_arithmetic,d7a200,score,gen,3.00
bbeh_web_of_lies,d7a200,score,gen,7.50
bbeh_word_sorting,d7a200,score,gen,2.00
bbeh_zebra_puzzles,d7a200,score,gen,3.50

Acknowledgments
This contribution builds on the existing BBEH framework and aligns with the OpenCompass project’s mission to advance LLM evaluation. Feedback and suggestions are welcome!

opencompass/openicl/icl_inferencer/icl_gen_inferencer.py

MaiziXiao · 2025-03-11T09:50:04Z

opencompass/configs/datasets/bbeh/bbeh_subset_settings.py

Do we need this file? You dataset config file has already included those subset names

Not needed here, sorry for the inconvenience

MaiziXiao

LGTM

yufeng zhao added 3 commits March 10, 2025 04:24

bbeh

1f0c5cb

bbeh

9f491fa

fix_smallbugs_bbeh

d99179d

mm-assistant bot assigned bittersweet1999 Mar 10, 2025

epsilondylan temporarily deployed to prod March 10, 2025 06:07 — with GitHub Actions Inactive

MaiziXiao reviewed Mar 10, 2025

View reviewed changes

opencompass/openicl/icl_inferencer/icl_gen_inferencer.py Outdated Show resolved Hide resolved

removeprint

f95064d

epsilondylan temporarily deployed to prod March 11, 2025 06:32 — with GitHub Actions Inactive

MaiziXiao reviewed Mar 11, 2025

View reviewed changes

results

a5abe18

MaiziXiao approved these changes Mar 12, 2025

View reviewed changes

epsilondylan temporarily deployed to prod March 12, 2025 02:53 — with GitHub Actions Inactive

MaiziXiao merged commit bc2969d into open-compass:main Mar 12, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bbeh #1925

Bbeh #1925

epsilondylan commented Mar 10, 2025

MaiziXiao Mar 11, 2025

epsilondylan Mar 11, 2025

MaiziXiao left a comment

Bbeh #1925

Bbeh #1925

Conversation

epsilondylan commented Mar 10, 2025

MaiziXiao Mar 11, 2025

Choose a reason for hiding this comment

epsilondylan Mar 11, 2025

Choose a reason for hiding this comment

MaiziXiao left a comment

Choose a reason for hiding this comment