EvalHub

Open source evaluation orchestration platform for Large Language Models.

EvalHub provides a unified way to evaluate LLMs across multiple frameworks — submit evaluations via CLI, Python SDK, or REST API and let EvalHub handle orchestration, tracking, and artifact storage. It runs locally for development and scales on Kubernetes for production.

Key Features

Multi-framework

Evaluate with LightEval, GuideLLM, lm-eval-harness, Garak, MTEB, IBM CLEAR, or bring your own framework.

Kubernetes-native

Each evaluation runs as an isolated Kubernetes Job with automatic lifecycle management.

Benchmark Collections

Group benchmarks with weighted scoring for domain-specific evaluations in a single API call.

MLflow Integration

Track experiments, compare runs, and persist metrics to MLflow automatically.

Versioned REST API (v1) with OpenAPI specification
Provider registry with benchmark discovery
OCI artifact persistence for evaluation results
Prometheus metrics and OpenTelemetry tracing
Multi-tenancy with namespace-based isolation and Kubernetes RBAC

How It Works

EvalHub architecture overview

EvalHub consists of three components:

Server — Go REST API that orchestrates evaluation jobs, manages providers, and stores results (SQLite or PostgreSQL)
SDK — Python client library, CLI, and adapter framework for building integrations
Contrib — Community-contributed framework adapters packaged as container images

Available Adapters

Adapter	Provider	Support Tier	What it measures
lm-eval-harness	`lm_evaluation_harness`	Core	167 benchmarks across 12 categories (MMLU, HellaSwag, GSM8K, …)
LightEval	`lighteval`	Core	Language model benchmarks: accuracy, exact match
Garak	`garak`	Core	Safety and vulnerability scanning (OWASP Top 10)
GuideLLM	`guidellm`	Validated	Inference performance: TTFT, ITL, throughput, latency
IBM CLEAR	`ibm-clear`	Validated	Agentic trace evaluation: LLM-as-judge failure pattern analysis on JSON traces
MTEB	`mteb`	Community	Embedding evaluation: STS, retrieval, classification

Quick Taste

pip install "eval-hub-sdk[cli]"

evalhub eval run \
  --name my-first-eval \
  --model-url http://localhost:11434/v1 \
  --model-name qwen2.5:1.5b \
  --provider lm_evaluation_harness \
  --benchmark mmlu \
  --wait

Get Started

Installation Install the server and a client

Quick Start Run your first evaluation step by step

CLI Reference Full command reference for the evalhub CLI

Python SDK Programmatic access with sync and async clients

License

Apache 2.0 — see LICENSE.