Skip to content

EvalHub

Open source evaluation orchestration platform for Large Language Models.

EvalHub provides a unified way to evaluate LLMs across multiple frameworks — submit evaluations via CLI, Python SDK, or REST API and let EvalHub handle orchestration, tracking, and artifact storage. It runs locally for development and scales on Kubernetes for production.

Multi-framework

Evaluate with LightEval, GuideLLM, lm-eval-harness, Garak, MTEB, IBM CLEAR, or bring your own framework.

Kubernetes-native

Each evaluation runs as an isolated Kubernetes Job with automatic lifecycle management.

Benchmark Collections

Group benchmarks with weighted scoring for domain-specific evaluations in a single API call.

MLflow Integration

Track experiments, compare runs, and persist metrics to MLflow automatically.

  • Versioned REST API (v1) with OpenAPI specification
  • Provider registry with benchmark discovery
  • OCI artifact persistence for evaluation results
  • Prometheus metrics and OpenTelemetry tracing
  • Multi-tenancy with namespace-based isolation and Kubernetes RBAC

EvalHub architecture overview

EvalHub consists of three components:

  • Server — Go REST API that orchestrates evaluation jobs, manages providers, and stores results (SQLite or PostgreSQL)
  • SDK — Python client library, CLI, and adapter framework for building integrations
  • Contrib — Community-contributed framework adapters packaged as container images
AdapterProviderSupport TierWhat it measures
lm-eval-harnesslm_evaluation_harnessCore167 benchmarks across 12 categories (MMLU, HellaSwag, GSM8K, …)
LightEvallightevalCoreLanguage model benchmarks: accuracy, exact match
GarakgarakCoreSafety and vulnerability scanning (OWASP Top 10)
GuideLLMguidellmValidatedInference performance: TTFT, ITL, throughput, latency
IBM CLEARibm-clearValidatedAgentic trace evaluation: LLM-as-judge failure pattern analysis on JSON traces
MTEBmtebCommunityEmbedding evaluation: STS, retrieval, classification
Terminal window
pip install "eval-hub-sdk[cli]"
evalhub eval run \
--name my-first-eval \
--model-url http://localhost:11434/v1 \
--model-name qwen2.5:1.5b \
--provider lm_evaluation_harness \
--benchmark mmlu \
--wait

Apache 2.0 — see LICENSE.