#

llms-benchmarking

Here are 56 public repositories matching this topic...

JonathanChavezTamales / LLMStats

A comprehensive set of LLM benchmark scores and provider prices.

llm llmops llm-evaluation llm-agents llms-benchmarking

Updated Mar 4, 2025
JavaScript

ChemFoundationModels / ChemLLMBench

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

nlp benchmark chemistry ai4science large-language-models llm llms-benchmarking

Updated Jul 26, 2024
Jupyter Notebook

steel-dev / awesome-web-agents

🔥 A list of tools, frameworks, and resources for building AI web agents

ai browser-automation ai-agents llms llms-benchmarking

Updated Mar 8, 2025

lerogo / MMGenBench

Official repository of MMGenBench

mllm llms-benchmarking mmgenbench

Updated Mar 8, 2025
Python

bboylyg / BackdoorLLM

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

backdoor llms llms-benchmarking

Updated Feb 21, 2025
Python

parea-ai / parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Feb 13, 2025
Python

lamalab-org / chembench

How good are LLMs at chemistry?

benchmark machine-learning chemistry safety materials-science llm llms llms-benchmarking

Updated Mar 12, 2025
Python

FSoft-AI4Code / XMainframe

Language Model for Mainframe Modernization

migration cobol mainframe code-summarization codellm llms-benchmarking

Updated Aug 23, 2024
Python

lechmazur / nyt-connections

Benchmark that evaluates LLMs using 601 NYT Connections puzzles extended with extra trick words

testing benchmark evaluation puzzles reasoning llm llms-benchmarking gpt-4o sonnet3-7 gpt-4-5

Updated Mar 12, 2025
Python

lechmazur / generalization

Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.

benchmark evaluation generalization llm llms llms-benchmarking sonnet3-7 gpt-4-5

Updated Mar 8, 2025

RaptorMai / CompBench

CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.

benchmark reasoning vision-and-language multimodal-deep-learning human-annotation foundation-models large-language-models llms vision-language-model multimodal-large-language-models evaluation-llms llms-benchmarking

Updated Aug 6, 2024
Jupyter Notebook

amazon-science / llm-code-preference

Training and Benchmarking LLMs for Code Preference.

code-generation llm-training llm-evaluation llms-benchmarking

Updated Nov 15, 2024
Python

epfl-dlab / cc_flows

The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".

ai competitive-programming agents competitive-programming-contests competitive-coding llms llms-reasoning llms-benchmarking aiflows

Updated Feb 12, 2024
Python

multinear / multinear

Develop reliable AI apps

reliability evaluation llm llms llm-eval llm-evaluation llms-benchmarking llm-evaluation-framework

Updated Mar 12, 2025
Svelte

declare-lab / resta

Restore safety in fine-tuned language models through task arithmetic

alignment safety alignment-algorithm llm llms llm-safety llms-benchmarking llm-safety-benchmark

Updated Mar 28, 2024
Python

rajpurkarlab / craft-md

conversational-ai llms-benchmarking clinical-llm multiturn-conversations

Updated Mar 4, 2025
Python

Laoyu84 / 4onebench

A minimalist benchmarking tool designed to test the routine-generation capabilities of LLMs.

agents large-language-models llms-benchmarking

Updated Nov 28, 2024
Python

minnesotanlp / cobbler

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

nlp evaluation bias bias-detection llm llms llm-evaluation llms-benchmarking llm-as-judge llm-as-a-judge llm-as-evaluator

Updated Feb 16, 2024
Jupyter Notebook

Paulescu / text-embedding-evaluation

Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️

machine-learning embeddings llms llms-benchmarking

Updated Apr 19, 2024
Python

logikon-ai / cot-eval

A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.

leaderboard llm chain-of-thought gen-ai llms-reasoning llms-benchmarking

Updated Feb 6, 2025
Jupyter Notebook

Improve this page

Add a description, image, and links to the llms-benchmarking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llms-benchmarking topic, visit your repo's landing page and select "manage topics."