llm-benchmarking

Star

Here are 69 public repositories matching this topic...

SalesforceAIResearch / enterprise-deep-research

Star

Salesforce Enterprise Deep Research

react multi-agent-systems tailwindcss fastapi e2b langchain llm-benchmarking tavily deep-research-agent

Updated Nov 19, 2025
Python

lechmazur / confabulations

Star

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

benchmark leaderboard gemini llama language-model claude rag o1 hallucinations ai-evaluation llm gemini-pro llm-benchmarking confabulations deepseek-r1 o3-mini

Updated Aug 7, 2025
HTML

HiThink-Research / BizFinBench

Star

A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

finance benchmark llm llm-evaluation llm-benchmarking

Updated Nov 25, 2025
Python

tongye98 / Awesome-Code-Benchmark

Star

A comprehensive code domain benchmark review of LLM researches.

data-science awesome benchmarks code-generation code-completion bug-fixing reasoning multimodal codellm code-efficiency codellms llm-benchmarking

Updated Sep 22, 2025

A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.

evaluation llm llm-evaluation llm-benchmarking generative-ai-benchmarking

Updated Dec 26, 2025
HTML

robertvacareanu / llm4regression

Star

Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update

linear-regression sklearn regression regression-models large-language-models llm llms llm-inference llm-benchmarking

Updated Oct 12, 2025
Python

lakeraai / pint-benchmark

Star

A benchmark for prompt injection detection systems.

benchmark llm prompt-injection llm-security llm-benchmarking

Updated Dec 16, 2025
Jupyter Notebook

oripress / AlgoTune

Star

AlgoTune is a NeurIPS 2025 benchmark made up of 154 math, physics, and computer science problems. The goal is write code that solves each problem, and is faster than existing implementations.

code-generation code-optimization llm-agent llm-coder llm-benchmarking code-agent

Updated Dec 26, 2025
Python

asimsinan / LLM-Research

Star

A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks

arxiv-papers large-language-models llm llms llm-datasets llm-tools buyuk-dil-modelleri llm-research llm-theses llm-benchmarking llm-frameworks

Updated Oct 8, 2024
Python

damianomarsili / VADAR

Star

Program synthesis for 3D spatial reasoning

program-synthesis 3d spatial-reasoning llms llm-benchmarking

Updated Jun 16, 2025
Jupyter Notebook

AKSW / LLM-KG-Bench

Star

LLM-KG-Bench is a Framework and task collection for automated benchmarking of Large Language Models (LLMs) on Knowledge Graph (KG) related tasks.

sparql rdf knowledge-graph large-language-models llm llm-benchmarking

Updated Dec 22, 2025
Python

forecastingresearch / forecastbench

Star

A dynamic forecasting benchmark for LLMs

forecasting llm-benchmarking

Updated Dec 25, 2025
HTML

MJ-Bench / MJ-Bench

Star

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

reward-models multimodal-foundation-model llm-benchmarking llm-as-a-judge multimodal-judge

Updated Jun 3, 2025
Jupyter Notebook

roboflow / vision-ai-checkup

Star

Take your LLM to the optometrist.

vlm vision-language llm vision-language-model llm-benchmarking

Updated Dec 12, 2025
Python

HiThink-Research / MME-Finance

Star

[MM 2025] A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning

finance multimodal llm llm-evaluation llm-benchmarking mmllm

Updated Oct 14, 2025
Python

nl4opt / ORQA

Star

[AAAI 2025] ORQA is a new QA benchmark designed to assess the reasoning capabilities of LLMs in a specialized technical domain of Operations Research. The benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when presented with complex optimization modeling tasks.

optimization linear-programming operations-research mathematical-modelling mixed-integer-programming multi-choice llm llm-reasoning llm-benchmarking llm4math aaai2025 ai4or llm4or llm4opt

Updated Jun 7, 2025
Python

Belluxx / LocalAIME

Star

Test your local LLMs on the AIME problems

benchmark-datasets local-llm llm-benchmarking

Updated Jun 7, 2025
Python

lechmazur / deception

Star

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

nlp machine-learning gemini llama language-model model-evaluation ai-safety mistral claude disinformation ai-security ai-benchmarks ai-evaluation llm llm-benchmarking gpt4o

Updated Mar 20, 2025

AUCOHL / RTL-Repo

Star

RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects - IEEE LAD'24

verilog rtl-design llm llm-benchmarking

Updated Jun 5, 2024
Python

truefoundry / llm-locust

Star

LLM Locust combines the simplicity of Locust with deep support for LLM-specific benchmarking

llm-benchmarking llm-performance

Updated Dec 2, 2025
TypeScript

Improve this page

Add a description, image, and links to the llm-benchmarking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-benchmarking topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-benchmarking

Here are 69 public repositories matching this topic...

SalesforceAIResearch / enterprise-deep-research

lechmazur / confabulations

HiThink-Research / BizFinBench

tongye98 / Awesome-Code-Benchmark

alopatenko / LLMEvaluation

robertvacareanu / llm4regression

lakeraai / pint-benchmark

oripress / AlgoTune

asimsinan / LLM-Research

damianomarsili / VADAR

AKSW / LLM-KG-Bench

forecastingresearch / forecastbench

MJ-Bench / MJ-Bench

roboflow / vision-ai-checkup

HiThink-Research / MME-Finance

nl4opt / ORQA

Belluxx / LocalAIME

lechmazur / deception

AUCOHL / RTL-Repo

truefoundry / llm-locust

Improve this page

Add this topic to your repo