Skip to content

CLI

The evalhub command-line tool lets you submit evaluation jobs, check status, retrieve results, and manage collections and configuration — all from a terminal or shell script.

EvalHub is assumed to be running on your OpenShift cluster. If it is not, see Installation first.

Install the CLI

pip install "eval-hub-sdk[cli]"

Verify the install:

evalhub version
# evalhub 0.1.3

Configure a connection

Before running any commands, tell the CLI where your EvalHub server is. Connections are stored in named profiles at ~/.config/evalhub/config.yaml.

evalhub config set base_url https://evalhub.apps.my-cluster.example.com
evalhub config set token $(kubectl create token <service-account> -n <namespace>)
evalhub config set tenant my-team

Check the active profile:

evalhub config list
# Profile: default
#   base_url: https://evalhub.apps.my-cluster.example.com
#   token: sha256~...
#   tenant: my-team

If you work with multiple clusters, create named profiles:

evalhub config set base_url https://evalhub.staging.example.com
evalhub config use staging

Switch back with evalhub config use default. Any command accepts --profile <name> to override at runtime without changing the active profile.

I want to see what providers and benchmarks are available

evalhub providers list
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ ID                    ┃ NAME                  ┃ DESCRIPTION                             ┃ BENCHMARKS ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ lm_evaluation_harness │ LM Evaluation Harness │ EleutherAI language model evaluation    │ 167        │
│ garak                 │ Garak                 │ LLM vulnerability and safety scanner    │ 12         │
│ guidellm              │ GuideLLM              │ Performance benchmarking                │ 4          │
└───────────────────────┴───────────────────────┴─────────────────────────────────────────┴────────────┘

To see what a provider offers:

evalhub providers describe lm_evaluation_harness
Provider: LM Evaluation Harness
ID:       lm_evaluation_harness
Description: EleutherAI language model evaluation framework

Benchmarks (167):
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ID            ┃ NAME                            ┃ CATEGORY            ┃ METRICS                             ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ mmlu          │ Massive Multitask Language Und… │ knowledge           │ acc, acc_norm                       │
│ hellaswag     │ HellaSwag                       │ reasoning           │ acc, acc_norm                       │
│ gsm8k         │ Grade School Math 8K            │ math                │ exact_match                         │
│ arc_easy      │ ARC Easy                        │ reasoning           │ acc, acc_norm                       │
│ ...           │ ...                             │ ...                 │ ...                                 │
└───────────────┴─────────────────────────────────┴─────────────────────┴─────────────────────────────────────┘

I want to check the service is up

evalhub health
# EvalHub service: healthy (42ms)

I want to run an evaluation

From a config file

Create a YAML file describing the job:

# eval.yaml
name: llama3-reasoning-eval
model:
  url: https://llama3.apps.my-cluster.example.com/v1
  name: meta-llama/Llama-3.2-8B-Instruct
benchmarks:
  - id: hellaswag
    provider_id: lm_evaluation_harness
  - id: gsm8k
    provider_id: lm_evaluation_harness

Submit it:

evalhub eval run --config eval.yaml
# Job submitted: eval-a1b2c3d4

From flags

evalhub eval run \
  --name llama3-mmlu \
  --model-url https://llama3.apps.my-cluster.example.com/v1 \
  --model-name meta-llama/Llama-3.2-8B-Instruct \
  --provider lm_evaluation_harness \
  --benchmark mmlu \
  --benchmark hellaswag
# Job submitted: eval-e5f6g7h8

Non-blocking by default

eval run returns as soon as the job is accepted by the server, giving you the job ID to use later:

evalhub eval run --config eval.yaml
# Job submitted: eval-a1b2c3d4

You can check progress separately with evalhub eval status eval-a1b2c3d4 and retrieve results when it finishes.

Blocking until complete

Add --wait to block the shell until the job reaches a terminal state:

evalhub eval run --config eval.yaml --wait
# Job submitted: eval-a1b2c3d4
# Waiting for job eval-a1b2c3d4 to complete...
# Job eval-a1b2c3d4 finished with state: completed

The command exits with code 1 if the job fails, making it suitable for CI pipelines.

If the model endpoint requires authentication, see Model authentication.

I want to check what's running

List all jobs:

evalhub eval status
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ID                 ┃ NAME                    ┃ STATE       ┃ PROVIDER              ┃ BENCHMARKS ┃ CREATED                    ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ eval-a1b2c3d4      │ llama3-reasoning-eval   │ running     │ lm_evaluation_harness │ 2          │ 2026-03-25 10:00:00+00:00  │
│ eval-e5f6g7h8      │ llama3-mmlu             │ completed   │ lm_evaluation_harness │ 2          │ 2026-03-24 09:15:00+00:00  │
└────────────────────┴─────────────────────────┴─────────────┴───────────────────────┴────────────┴────────────────────────────┘

Filter by status:

evalhub eval status --status running
evalhub eval status --status failed

Inspect a single job:

evalhub eval status eval-a1b2c3d4
# Job:     eval-a1b2c3d4
# Name:    llama3-reasoning-eval
# State:   running
# Model:   meta-llama/Llama-3.2-8B-Instruct (https://llama3.apps.my-cluster.example.com/v1)
# Created: 2026-03-25 10:00:00+00:00

Watch a job until it finishes:

evalhub eval status eval-a1b2c3d4 --watch

I want to see the results

evalhub eval results eval-a1b2c3d4
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ BENCHMARK  ┃ PROVIDER              ┃ METRIC                ┃ VALUE   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ hellaswag  │ lm_evaluation_harness │ acc                   │ 0.7823  │
│ hellaswag  │ lm_evaluation_harness │ acc_norm              │ 0.8012  │
│ gsm8k      │ lm_evaluation_harness │ exact_match           │ 0.6540  │
└────────────┴───────────────────────┴───────────────────────┴─────────┘

Export for downstream processing:

# JSON
evalhub eval results eval-a1b2c3d4 --format json > results.json

# CSV
evalhub eval results eval-a1b2c3d4 --format csv > results.csv

I want to cancel a job

evalhub eval cancel eval-a1b2c3d4
# Are you sure you want to cancel this job? [y/N]: y
# Job eval-a1b2c3d4 cancelled.

To permanently remove it:

evalhub eval cancel eval-a1b2c3d4 --hard-delete

I want to work with collections

A collection is a named set of benchmarks that can be run together as a single job. Collections are defined on the server; the CLI lets you browse and run them.

List available collections

evalhub collections list
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ ID                       ┃ NAME                     ┃ DESCRIPTION                    ┃ TAGS            ┃ BENCHMARKS ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ healthcare_safety_v1     │ Healthcare Safety v1     │ Medical domain benchmarks      │ safety, medical │ 8          │
│ finance_compliance_v1    │ Finance Compliance v1    │ Financial reasoning benchmarks │ finance         │ 5          │
│ general_llm_eval_v1      │ General LLM Eval v1      │ Broad capability evaluation    │ general         │ 12         │
└──────────────────────────┴──────────────────────────┴────────────────────────────────┴─────────────────┴────────────┘

Filter by tag:

evalhub collections list --tag safety

Inspect a collection

evalhub collections describe healthcare_safety_v1
# Collection: Healthcare Safety v1
# ID:          healthcare_safety_v1
# Description: Medical domain benchmarks
# Tags:        safety, medical
# Pass threshold: 0.75
#
# Benchmarks (8):
# ...

Run a collection

evalhub collections run healthcare_safety_v1 \
  --model-url https://llama3.apps.my-cluster.example.com/v1 \
  --model-name meta-llama/Llama-3.2-8B-Instruct
# Job submitted: eval-z9y8x7w6

Add --wait to block until all benchmarks complete:

evalhub collections run healthcare_safety_v1 \
  --model-url https://llama3.apps.my-cluster.example.com/v1 \
  --model-name meta-llama/Llama-3.2-8B-Instruct \
  --wait

Give the resulting job a meaningful name:

evalhub collections run healthcare_safety_v1 \
  --model-url https://llama3.apps.my-cluster.example.com/v1 \
  --model-name meta-llama/Llama-3.2-8B-Instruct \
  --name llama3-healthcare-2026-03-25

Create a collection

Define the collection in YAML:

# bias-collection.yaml
name: Bias and Fairness
description: Benchmarks for bias detection and fairness evaluation
tags:
  - safety
  - bias
benchmarks:
  - id: bbq
    provider_id: lm_evaluation_harness
    weight: 1.0
  - id: winogender
    provider_id: lm_evaluation_harness
    weight: 1.0
pass_criteria:
  threshold: 0.70
evalhub collections create --file bias-collection.yaml
# Collection created: bias-and-fairness-a1b2

Delete a collection

evalhub collections delete bias-and-fairness-a1b2
# Are you sure you want to delete this collection? [y/N]: y
# Collection bias-and-fairness-a1b2 deleted.

Pass --yes to skip the confirmation prompt in scripts.

Using in CI/CD

The CLI is designed for scripted use. All commands return standard exit codes (0 on success, non-zero on failure), and every command that produces data supports --format json for machine-readable output.

GitHub Actions example

- name: Run safety evaluation
  env:
    EVALHUB_BASE_URL: ${{ secrets.EVALHUB_URL }}
    EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
  run: |
    pip install "eval-hub-sdk[cli]"

    evalhub eval run \
      --name "ci-eval-${{ github.sha }}" \
      --model-url "$MODEL_URL" \
      --model-name "$MODEL_NAME" \
      --provider lm_evaluation_harness \
      --benchmark mmlu \
      --wait

- name: Export results
  run: |
    evalhub eval results --format json > eval-results.json

- name: Upload results
  uses: actions/upload-artifact@v4
  with:
    name: eval-results
    path: eval-results.json

EVALHUB_BASE_URL and EVALHUB_TOKEN are read from environment variables automatically — no config file needed in CI.

Output formats

All data-returning commands accept --format:

Format Use case
table Default; human-readable terminal view
json Machine-readable; pipe to jq
yaml Config-compatible output
csv Spreadsheet import

Example:

evalhub eval status --format json | jq '.[].state'

ANSI colour codes are stripped automatically when stdout is not a TTY, so piped output is always clean.

Command reference

Command Description
evalhub version Print version
evalhub health Check EvalHub service health
eval
evalhub eval run Submit an evaluation job
evalhub eval status [job-id] List jobs or inspect one
evalhub eval results <job-id> Show evaluation results
evalhub eval cancel <job-id> Cancel a job
collections
evalhub collections list List all collections
evalhub collections describe <id> Show collection details
evalhub collections create --file <spec> Create a collection
evalhub collections run <id> Run a collection as a job
evalhub collections delete <id> Delete a collection
providers
evalhub providers list List registered providers
evalhub providers describe <id> Show provider details
config
evalhub config set <key> <value> Set a config value
evalhub config get <key> Read a config value
evalhub config list Show active profile
evalhub config use <profile> Switch profile

Every command supports --help for full flag details.

Next steps