Skip to content

EvalHub

Open source evaluation orchestration platform for Large Language Models.

What is EvalHub?

EvalHub is an evaluation orchestration platform designed for systematic LLM evaluation. It supports both local development and Kubernetes-native deployment for scale. The platform consists of three components:

  • EvalHub Server: REST API orchestration service for managing evaluation workflows
  • EvalHub SDK: Python SDK for submitting evaluations and building adapters
  • EvalHub Contrib: Community-contributed framework adapters

Core Problem Solved

LLM evaluation involves coordinating multiple frameworks, managing evaluation jobs, tracking results, and handling diverse deployment environments. EvalHub solves this by providing:

  • Unified evaluation API: Submit evaluations across frameworks using a consistent interface
  • Kubernetes-native orchestration: Automatic job lifecycle management, scaling, and resource isolation
  • Framework adapter pattern: "Bring Your Own Framework" (BYOF) approach with standardised integration
  • Multi-environment support: Deploy locally for development or on OpenShift for production

Architecture Overview

graph TB
    subgraph client["Client Applications"]
        PY[Python SDK Client]
        API[Direct REST API]
    end

    subgraph server["EvalHub Server"]
        REST[REST API Server]
        ORCH[Job Orchestrator]
        PROV[Provider Registry]
    end

    subgraph k8s["Kubernetes / OpenShift"]
        JOB[Evaluation Jobs]
        ADAPT[Framework Adapters]
        SIDE[Sidecar Containers]
    end

    subgraph storage["Storage"]
        OCI[OCI Registry]
        DB[(PostgreSQL)]
    end

    PY --> REST
    API --> REST
    REST --> ORCH
    REST --> PROV
    ORCH --> JOB
    JOB --> ADAPT
    JOB --> SIDE
    ADAPT --> SIDE
    SIDE --> OCI
    REST --> DB

    style server fill:#e3f2fd
    style k8s fill:#fff3e0
    style storage fill:#f3e5f5

Component Responsibilities

Component Description Technology
Server REST API, job orchestration, provider management Go, SQLite (local) / PostgreSQL (production)
SDK Client library, adapter framework, data models Python 3.12+
Contrib Community framework adapters (LightEval, GuideLLM) Python containers
Jobs Isolated evaluation execution environments Kubernetes Jobs
Registry Immutable artifact storage OCI registries

Key Features

Server Features

  • REST API: Versioned API (v1) with OpenAPI specification
  • Provider management: Discover evaluation providers and benchmarks
  • Collection management: Curated benchmark collections with weighted scoring
  • Job orchestration: Kubernetes Job lifecycle management
  • Persistent storage: SQLite for local development, PostgreSQL for production
  • Prometheus metrics: Production-ready observability

SDK Features

Client SDK (evalhub.client):

  • Submit evaluations to EvalHub service
  • Discover providers, benchmarks, and collections
  • Monitor job status and retrieve results
  • Async/await support for non-blocking workflows

Adapter SDK (evalhub.adapter):

  • Base class for framework integration
  • Callback interface for status reporting
  • OCI artifact persistence
  • Settings-based configuration

Core Models (evalhub.models):

  • Shared data structures
  • Request/response schemas
  • API model validation

Contrib Features

  • LightEval: Language model evaluation (HellaSwag, ARC, MMLU, TruthfulQA, GSM8K)
  • GuideLLM: Performance benchmarking (TTFT, ITL, throughput, latency)
  • Containerised: Production-ready container images

Quick Start

Submit an Evaluation (Python SDK)

from evalhub.client import EvalHubClient
from evalhub.models.api import ModelConfig, BenchmarkSpec

# Connect to EvalHub service
client = EvalHubClient(base_url="http://localhost:8080")

# Submit evaluation
job = client.submit_evaluation(
    model=ModelConfig(
        url="https://api.openai.com/v1",
        name="gpt-4"
    ),
    benchmarks=[
        BenchmarkSpec(
            benchmark_id="mmlu",
            provider_id="lm_evaluation_harness"
        )
    ]
)

# Monitor progress
status = client.get_job_status(job.job_id)
print(f"Status: {status.status}, Progress: {status.progress}")

Deploy on OpenShift

# Apply EvalHub manifests
oc apply -k config/openshift/

# Verify deployment
oc get pods -n eval-hub
oc logs -f deployment/eval-hub-server

Build a Framework Adapter

from evalhub.adapter import FrameworkAdapter, JobSpec, JobResults, JobCallbacks
from evalhub.adapter.models import JobStatusUpdate, JobStatus, JobPhase, OCIArtifactSpec

class MyAdapter(FrameworkAdapter):
    def run_benchmark_job(
        self,
        config: JobSpec,
        callbacks: JobCallbacks
    ) -> JobResults:
        # Load framework
        framework = load_your_framework()

        # Report progress
        callbacks.report_status(JobStatusUpdate(
            status=JobStatus.RUNNING,
            phase=JobPhase.RUNNING_EVALUATION,
            progress=0.5,
            message="Running evaluation"
        ))

        # Run evaluation
        results = framework.evaluate(config.benchmark_id)

        # Persist artifacts
        oci_result = callbacks.create_oci_artifact(OCIArtifactSpec(
            files=["results.json", "report.html"],
            job_id=config.job_id,
            benchmark_id=config.benchmark_id,
            model_name=config.model.name
        ))

        # Return results
        return JobResults(
            job_id=config.job_id,
            benchmark_id=config.benchmark_id,
            model_name=config.model.name,
            results=results.evaluation_results,
            num_examples_evaluated=len(results.evaluation_results),
            duration_seconds=results.duration,
            oci_artifact=oci_result
        )

Use Cases

Model Evaluation

Systematically evaluate LLMs across standardised benchmarks:

# Evaluate model on multiple benchmarks
client.submit_evaluation(
    model=ModelConfig(url="...", name="llama-3-8b"),
    benchmarks=[
        BenchmarkSpec(benchmark_id="mmlu", provider_id="lm_evaluation_harness"),
        BenchmarkSpec(benchmark_id="humaneval", provider_id="lm_evaluation_harness"),
        BenchmarkSpec(benchmark_id="gsm8k", provider_id="lm_evaluation_harness"),
    ]
)

Performance Testing

Benchmark inference server performance under load:

# Test server throughput and latency
client.submit_evaluation(
    model=ModelConfig(url="http://vllm-server:8000", name="llama-3-8b"),
    benchmarks=[
        BenchmarkSpec(
            benchmark_id="performance_test",
            provider_id="guidellm",
            config={
                "profile": "constant",
                "rate": 10,
                "max_seconds": 60
            }
        )
    ]
)

Collection-Based Evaluation

Run curated benchmark collections:

# Evaluate using predefined collection
client.submit_evaluation(
    model=ModelConfig(url="...", name="gpt-4"),
    collection_id="healthcare_safety_v1"  # Expands to multiple benchmarks
)

Community

Support

License

Apache 2.0 - see LICENSE for details.