CLI
The evalhub command-line tool lets you submit evaluation jobs, check status, retrieve results, and manage collections and configuration — all from a terminal or shell script.
EvalHub is assumed to be running on your OpenShift cluster. If it is not, see Installation first.
Install the CLI
Section titled “Install the CLI”pip install "eval-hub-sdk[cli]"Verify the install:
evalhub version# evalhub 0.1.3Configure a connection
Section titled “Configure a connection”Before running any commands, tell the CLI where your EvalHub server is. Connections are stored in named profiles at ~/.config/evalhub/config.yaml.
evalhub config set base_url https://evalhub.apps.my-cluster.example.comevalhub config set token $(kubectl create token <service-account> -n <namespace>)evalhub config set tenant my-teamCheck the active profile:
evalhub config list# Profile: default# base_url: https://evalhub.apps.my-cluster.example.com# token: sha256~...# tenant: my-teamIf you work with multiple clusters, create named profiles:
evalhub config set base_url https://evalhub.staging.example.comevalhub config use stagingSwitch back with evalhub config use default. Any command accepts --profile <name> to override at runtime without changing the active profile.
I want to see what providers and benchmarks are available
Section titled “I want to see what providers and benchmarks are available”evalhub providers list┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓┃ ID ┃ NAME ┃ DESCRIPTION ┃ BENCHMARKS ┃┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩│ lm_evaluation_harness │ LM Evaluation Harness │ EleutherAI language model evaluation │ 167 ││ garak │ Garak │ LLM vulnerability and safety scanner │ 12 ││ guidellm │ GuideLLM │ Performance benchmarking │ 4 │└───────────────────────┴───────────────────────┴─────────────────────────────────────────┴────────────┘To see what a provider offers:
evalhub providers describe lm_evaluation_harnessProvider: LM Evaluation HarnessID: lm_evaluation_harnessDescription: EleutherAI language model evaluation framework
Benchmarks (167):┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ ID ┃ NAME ┃ CATEGORY ┃ METRICS ┃┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩│ mmlu │ Massive Multitask Language Und… │ knowledge │ acc, acc_norm ││ hellaswag │ HellaSwag │ reasoning │ acc, acc_norm ││ gsm8k │ Grade School Math 8K │ math │ exact_match ││ arc_easy │ ARC Easy │ reasoning │ acc, acc_norm ││ ... │ ... │ ... │ ... │└───────────────┴─────────────────────────────────┴─────────────────────┴─────────────────────────────────────┘I want to check the service is up
Section titled “I want to check the service is up”evalhub health# EvalHub service: healthy (42ms)I want to run an evaluation
Section titled “I want to run an evaluation”From a config file
Section titled “From a config file”Create a YAML file describing the job:
name: llama3-reasoning-evalmodel: url: https://llama3.apps.my-cluster.example.com/v1 name: meta-llama/Llama-3.2-8B-Instructbenchmarks: - id: hellaswag provider_id: lm_evaluation_harness - id: gsm8k provider_id: lm_evaluation_harnessSubmit it:
evalhub eval run --config eval.yaml# Job submitted: eval-a1b2c3d4From flags
Section titled “From flags”evalhub eval run \ --name llama3-mmlu \ --model-url https://llama3.apps.my-cluster.example.com/v1 \ --model-name meta-llama/Llama-3.2-8B-Instruct \ --provider lm_evaluation_harness \ --benchmark mmlu \ --benchmark hellaswag# Job submitted: eval-e5f6g7h8Config file vs. inline flags
Section titled “Config file vs. inline flags”Do not mix the two styles on the same command. When you pass --config, the YAML or JSON file is the complete job specification for eval run. All other job-related flags on that same command (for example --name, --model-url, --provider, --benchmark, --experiment, --metric, and --description) are not applied—anything that belongs in the job request must live in the file (see the quick start for full examples, including MLflow tracking).
You can still use --wait, --timeout, --poll-interval, and --format with --config, because they only control how the CLI waits for or prints the result, not the job body sent to the server.
Non-blocking by default
Section titled “Non-blocking by default”eval run returns as soon as the job is accepted by the server, giving you the job ID to use later:
evalhub eval run --config eval.yaml# Job submitted: eval-a1b2c3d4You can check progress separately with evalhub eval status eval-a1b2c3d4 and retrieve results when it finishes.
Blocking until complete
Section titled “Blocking until complete”Add --wait to block the shell until the job reaches a terminal state:
evalhub eval run --config eval.yaml --wait# Job submitted: eval-a1b2c3d4# Waiting for job eval-a1b2c3d4 to complete...# Job eval-a1b2c3d4 finished with state: completedThe command exits with code 1 if the job fails, making it suitable for CI pipelines.
If the model endpoint requires authentication, see Model authentication.
I want to check what’s running
Section titled “I want to check what’s running”List all jobs:
evalhub eval status┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ ID ┃ NAME ┃ STATE ┃ PROVIDER ┃ BENCHMARKS ┃ CREATED ┃┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩│ eval-a1b2c3d4 │ llama3-reasoning-eval │ running │ lm_evaluation_harness │ 2 │ 2026-03-25 10:00:00+00:00 ││ eval-e5f6g7h8 │ llama3-mmlu │ completed │ lm_evaluation_harness │ 2 │ 2026-03-24 09:15:00+00:00 │└────────────────────┴─────────────────────────┴─────────────┴───────────────────────┴────────────┴────────────────────────────┘Filter by status:
evalhub eval status --status runningevalhub eval status --status failedInspect a single job:
evalhub eval status eval-a1b2c3d4# Job: eval-a1b2c3d4# Name: llama3-reasoning-eval# State: running# Model: meta-llama/Llama-3.2-8B-Instruct (https://llama3.apps.my-cluster.example.com/v1)# Created: 2026-03-25 10:00:00+00:00Watch a job until it finishes:
evalhub eval status eval-a1b2c3d4 --watchI want to see the results
Section titled “I want to see the results”evalhub eval results eval-a1b2c3d4┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓┃ BENCHMARK ┃ PROVIDER ┃ METRIC ┃ VALUE ┃┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩│ hellaswag │ lm_evaluation_harness │ acc │ 0.7823 ││ hellaswag │ lm_evaluation_harness │ acc_norm │ 0.8012 ││ gsm8k │ lm_evaluation_harness │ exact_match │ 0.6540 │└────────────┴───────────────────────┴───────────────────────┴─────────┘Export for downstream processing:
# JSONevalhub eval results eval-a1b2c3d4 --format json > results.json
# CSVevalhub eval results eval-a1b2c3d4 --format csv > results.csvI want to cancel a job
Section titled “I want to cancel a job”evalhub eval cancel eval-a1b2c3d4# Are you sure you want to cancel this job? [y/N]: y# Job eval-a1b2c3d4 cancelled.To permanently remove it:
evalhub eval cancel eval-a1b2c3d4 --hard-deleteI want to work with collections
Section titled “I want to work with collections”A collection is a named set of benchmarks that can be run together as a single job. Collections are defined on the server; the CLI lets you browse and run them.
List available collections
Section titled “List available collections”evalhub collections list┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓┃ ID ┃ NAME ┃ DESCRIPTION ┃ TAGS ┃ BENCHMARKS ┃┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩│ healthcare_safety_v1 │ Healthcare Safety v1 │ Medical domain benchmarks │ safety, medical │ 8 ││ finance_compliance_v1 │ Finance Compliance v1 │ Financial reasoning benchmarks │ finance │ 5 ││ general_llm_eval_v1 │ General LLM Eval v1 │ Broad capability evaluation │ general │ 12 │└──────────────────────────┴──────────────────────────┴────────────────────────────────┴─────────────────┴────────────┘Filter by tag:
evalhub collections list --tag safetyInspect a collection
Section titled “Inspect a collection”evalhub collections describe healthcare_safety_v1# Collection: Healthcare Safety v1# ID: healthcare_safety_v1# Description: Medical domain benchmarks# Tags: safety, medical# Pass threshold: 0.75## Benchmarks (8):# ...Run a collection
Section titled “Run a collection”evalhub collections run healthcare_safety_v1 \ --model-url https://llama3.apps.my-cluster.example.com/v1 \ --model-name meta-llama/Llama-3.2-8B-Instruct# Job submitted: eval-z9y8x7w6Add --wait to block until all benchmarks complete:
evalhub collections run healthcare_safety_v1 \ --model-url https://llama3.apps.my-cluster.example.com/v1 \ --model-name meta-llama/Llama-3.2-8B-Instruct \ --waitGive the resulting job a meaningful name:
evalhub collections run healthcare_safety_v1 \ --model-url https://llama3.apps.my-cluster.example.com/v1 \ --model-name meta-llama/Llama-3.2-8B-Instruct \ --name llama3-healthcare-2026-03-25Create a collection
Section titled “Create a collection”Define the collection in YAML:
name: Bias and Fairnessdescription: Benchmarks for bias detection and fairness evaluationtags: - safety - biasbenchmarks: - id: bbq provider_id: lm_evaluation_harness weight: 1.0 - id: winogender provider_id: lm_evaluation_harness weight: 1.0pass_criteria: threshold: 0.70evalhub collections create --file bias-collection.yaml# Collection created: bias-and-fairness-a1b2Delete a collection
Section titled “Delete a collection”evalhub collections delete bias-and-fairness-a1b2# Are you sure you want to delete this collection? [y/N]: y# Collection bias-and-fairness-a1b2 deleted.Pass --yes to skip the confirmation prompt in scripts.
Using in CI/CD
Section titled “Using in CI/CD”The CLI is designed for scripted use. All commands return standard exit codes (0 on success, non-zero on failure), and every command that produces data supports --format json for machine-readable output.
GitHub Actions example
Section titled “GitHub Actions example”- name: Run safety evaluation env: EVALHUB_BASE_URL: ${{ secrets.EVALHUB_URL }} EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }} run: | pip install "eval-hub-sdk[cli]"
JOB_ID=$(evalhub eval run \ --name "ci-eval-${{ github.sha }}" \ --model-url "$MODEL_URL" \ --model-name "$MODEL_NAME" \ --provider lm_evaluation_harness \ --benchmark mmlu \ --wait \ --format json | jq -r '.[0].resource.id') echo "JOB_ID=$JOB_ID" >> "$GITHUB_ENV"
- name: Export results run: | evalhub eval results "$JOB_ID" --format json > eval-results.json
- name: Upload results uses: actions/upload-artifact@v4 with: name: eval-results path: eval-results.jsonEVALHUB_BASE_URL and EVALHUB_TOKEN are read from environment variables automatically — no config file needed in CI.
Output formats
Section titled “Output formats”All data-returning commands accept --format:
| Format | Use case |
|---|---|
table | Default; human-readable terminal view |
json | Machine-readable; pipe to jq |
yaml | Config-compatible output |
csv | Spreadsheet import |
Example:
evalhub eval status --format json | jq '.[].state'ANSI colour codes are stripped automatically when stdout is not a TTY, so piped output is always clean.
Command reference
Section titled “Command reference”| Command | Description |
|---|---|
evalhub version | Print version |
evalhub health | Check EvalHub service health |
| eval | |
evalhub eval run | Submit an evaluation job |
evalhub eval status [job-id] | List jobs or inspect one |
evalhub eval results <job-id> | Show evaluation results |
evalhub eval cancel <job-id> | Cancel a job |
| collections | |
evalhub collections list | List all collections |
evalhub collections describe <id> | Show collection details |
evalhub collections create --file <spec> | Create a collection |
evalhub collections run <id> | Run a collection as a job |
evalhub collections delete <id> | Delete a collection |
| providers | |
evalhub providers list | List registered providers |
evalhub providers describe <id> | Show provider details |
| config | |
evalhub config set <key> <value> | Set a config value |
evalhub config get <key> | Read a config value |
evalhub config list | Show active profile |
evalhub config use <profile> | Switch profile |
Every command supports --help for full flag details.