Multi-framework
Evaluate with LightEval, GuideLLM, lm-eval-harness, Garak, MTEB, IBM CLEAR, or bring your own framework.
Open source evaluation orchestration platform for Large Language Models.
EvalHub provides a unified way to evaluate LLMs across multiple frameworks — submit evaluations via CLI, Python SDK, or REST API and let EvalHub handle orchestration, tracking, and artifact storage. It runs locally for development and scales on Kubernetes for production.
Multi-framework
Evaluate with LightEval, GuideLLM, lm-eval-harness, Garak, MTEB, IBM CLEAR, or bring your own framework.
Kubernetes-native
Each evaluation runs as an isolated Kubernetes Job with automatic lifecycle management.
Benchmark Collections
Group benchmarks with weighted scoring for domain-specific evaluations in a single API call.
MLflow Integration
Track experiments, compare runs, and persist metrics to MLflow automatically.
EvalHub consists of three components:
| Adapter | Provider | Support Tier | What it measures |
|---|---|---|---|
| lm-eval-harness | lm_evaluation_harness | Core | 167 benchmarks across 12 categories (MMLU, HellaSwag, GSM8K, …) |
| LightEval | lighteval | Core | Language model benchmarks: accuracy, exact match |
| Garak | garak | Core | Safety and vulnerability scanning (OWASP Top 10) |
| GuideLLM | guidellm | Validated | Inference performance: TTFT, ITL, throughput, latency |
| IBM CLEAR | ibm-clear | Validated | Agentic trace evaluation: LLM-as-judge failure pattern analysis on JSON traces |
| MTEB | mteb | Community | Embedding evaluation: STS, retrieval, classification |
pip install "eval-hub-sdk[cli]"
evalhub eval run \ --name my-first-eval \ --model-url http://localhost:11434/v1 \ --model-name qwen2.5:1.5b \ --provider lm_evaluation_harness \ --benchmark mmlu \ --waitApache 2.0 — see LICENSE.