Provider Catalog

Browse community-contributed evaluation providers from eval-hub-contrib. Each provider ships a provider.yaml that defines benchmarks, runtime settings, and container image references.

The catalog is synced from the contrib repository at site build time. Use the filters below to find a provider, then download either the raw provider.yaml or a ready-to-apply Kubernetes ConfigMap.

Showing 2 of 2 providers

Select a provider to view details and download configuration.

IBM CLEAR

ibm-clear

IBM CLEAR evaluation framework for error analysis and reporting on agent traces

Runtime

quay.io/evalhub/community-ibm-clear:latest

Benchmarks (3)

ID	Name	Category
`agentic-evaluation`	Agentic Evaluation	error-analysis
`agentic-evaluation-custom-criteria`	Agentic evaluation (custom criteria)	error-analysis
`agentic-evaluation-predefined-issues`	Agentic evaluation (predefined issues)	error-analysis

ConfigMap options

Preview ConfigMap YAML

apiVersion: v1
kind: ConfigMap
metadata:
  name: evalhub-provider-ibm-clear
  namespace: opendatahub
  labels:
    trustyai.opendatahub.io/evalhub-provider-type: system
    trustyai.opendatahub.io/evalhub-provider-name: ibm-clear
data:
  ibm-clear.yaml: |
    # IBM CLEAR — provider definition (same shape as eval-hub config/providers/*.yaml).
    # Mount into eval-hub or sync via your operator’s provider sync process.
    #
    # Discovery: eval-hub may surface `parameters` in UI. The adapter validates required
    # inputs early in run_benchmark_job (`_validate_config`). Benchmark-specific keys are
    # enforced in `_validate_benchmark_contract` (see benchmarks below).
    # MLflow: set MLFLOW_TRACKING_URI and job experiment_name (or parameters.mlflow_experiment_name).

    id: ibm-clear
    name: IBM CLEAR
    description: IBM CLEAR evaluation framework for error analysis and reporting on agent traces
    type: builtin

    runtime:
      k8s:
        image: quay.io/evalhub/community-ibm-clear:latest
        image_pull_policy: Always
        entrypoint:
          - python
          - main.py
        cpu_request: 100m
        memory_request: 128Mi
        cpu_limit: 500m
        memory_limit: 1Gi
      local: null

    benchmarks:
      - id: agentic-evaluation
        name: Agentic Evaluation
        description: >
          Default agentic CLEAR run: clusters recurring failure patterns across agents using the
          standard agent-mode judge criteria (correctness, completeness, clarity, etc.). Use this
          when you want IBM CLEAR’s baseline “what went wrong and how often.”
        category: error-analysis
        metrics:
          - total_interactions
          - total_issues
          - interactions_with_issues
          - interactions_no_issues
          - total_agents
          - pct_interactions_with_issues
          - issues_per_interaction
          - average_score
        tags:
          - erroranalysis
          - ibmclear

      - id: agentic-evaluation-custom-criteria
        name: Agentic evaluation (custom criteria)
        description: >
          Caller supplies parameters.evaluation_criteria (criterion name -> description). Required for this benchmark id.
        category: error-analysis
        metrics:
          - total_interactions
          - total_issues
          - interactions_with_issues
          - interactions_no_issues
          - total_agents
          - pct_interactions_with_issues
          - issues_per_interaction
          - average_score
        tags:
          - ibmclear
          - custom-rubric

      - id: agentic-evaluation-predefined-issues
        name: Agentic evaluation (predefined issues)
        description: >
          Caller supplies parameters.predefined_issues (list of issue strings). CLEAR skips automatic issue discovery.
        category: error-analysis
        metrics:
          - total_interactions
          - total_issues
          - interactions_with_issues
          - interactions_no_issues
          - total_agents
          - pct_interactions_with_issues
          - issues_per_interaction
          - average_score
        tags:
          - ibmclear
          - predefined-issues

    parameters:
      - name: data_dir
        type: string
        description: >
          Directory of trace JSON files for CLEAR (preferred). Required unless Eval Hub mounts
          traces under /test_data or /data in the job pod, or you set traces_input_dir instead.

      - name: traces_input_dir
        type: string
        description: >
          Alternate name for the trace directory (same semantics as data_dir).

      - name: eval_model_name
        type: string
        description: >
          Required. CLEAR judge model id (e.g. openai/your-model) passed through to the agentic pipeline.

      - name: provider
        type: string
        description: >
          Required. LiteLLM/OpenAI-style provider name for the judge (e.g. openai, azure).

      - name: inference_backend
        type: string
        default: litellm
        description: >
          CLEAR inference backend. Use "endpoint" with model.url or parameters.endpoint_url (or inference_url)
          for a fixed HTTP endpoint; default litellm uses the usual CLEAR LiteLLM path.

      - name: endpoint_url
        type: string
        description: >
          When inference_backend is "endpoint", HTTP base URL for the judge (inference_url is accepted as an alias).

      - name: inference_url
        type: string
        description: >
          Same role as endpoint_url when inference_backend is "endpoint".

      - name: evaluation_criteria
        type: object
        description: >
          Required when benchmark_id is agentic-evaluation-custom-criteria: map of criterion name -> description.

      - name: predefined_issues
        type: array
        description: >
          Required when benchmark_id is agentic-evaluation-predefined-issues: non-empty list of issue strings.

      - name: mlflow_experiment_name
        type: string
        description: >
          Optional fallback experiment name for MLflow artifact upload if JobSpec.experiment_name is unset.

      - name: clear_dashboard_theme
        type: string
        default: red_hat
        description: >
          Dashboard theme for the generated static CLEAR HTML report.
          Default (omit / "red_hat" / "redhat"): apply the branded theme.
          Opt out ("clear", "default", "original", "ibm", "none", "false", "0", "off"): keep CLEAR's stock HTML.

MTEB

mteb

Massive Text Embedding Benchmark - comprehensive evaluation for text embedding models

Runtime

quay.io/evalhub/community-mteb:latest

Benchmarks (5)

ID	Name	Category
`mteb_sts`	Semantic Textual Similarity Suite	semantic_similarity
`mteb_retrieval`	Retrieval Suite	retrieval
`mteb_classification`	Classification Suite	classification
`mteb_clustering`	Clustering Suite	clustering
`mteb_reranking`	Reranking Suite	reranking

ConfigMap options

Preview ConfigMap YAML

apiVersion: v1
kind: ConfigMap
metadata:
  name: evalhub-provider-mteb
  namespace: opendatahub
  labels:
    trustyai.opendatahub.io/evalhub-provider-type: system
    trustyai.opendatahub.io/evalhub-provider-name: mteb
data:
  mteb.yaml: |
    # MTEB Provider Configuration
    # Massive Text Embedding Benchmark - comprehensive evaluation for text embedding models

    id: mteb
    name: MTEB
    description: Massive Text Embedding Benchmark - comprehensive evaluation for text embedding models
    type: builtin

    runtime:
      k8s:
        image: quay.io/evalhub/community-mteb:latest
        entrypoint:
          - python
          - main.py
        cpu_request: 100m
        memory_request: 128Mi
        cpu_limit: 1000m
        memory_limit: 2Gi
      local:
        # Reserved for local runtime configuration

    benchmarks:
      # Preset benchmark suites
      - id: mteb_sts
        name: Semantic Textual Similarity Suite
        description: STS benchmark suite covering STS12-17, STSBenchmark, and SICK-R
        category: semantic_similarity
        metrics:
          - main_score
          - cosine_spearman
          - cosine_pearson
        tags:
          - embedding
          - sts
          - similarity
          - mteb
          - suite

      - id: mteb_retrieval
        name: Retrieval Suite
        description: Information retrieval benchmark suite with NFCorpus, SciFact, ArguAna, TRECCOVID, and Touche2020
        category: retrieval
        metrics:
          - main_score
          - ndcg_at_10
          - map_at_10
        tags:
          - embedding
          - retrieval
          - search
          - mteb
          - suite

      - id: mteb_classification
        name: Classification Suite
        description: Text classification benchmark suite with AmazonReviewsClassification, Banking77Classification, and EmotionClassification
        category: classification
        metrics:
          - main_score
          - accuracy
        tags:
          - embedding
          - classification
          - mteb
          - suite

      - id: mteb_clustering
        name: Clustering Suite
        description: Document clustering benchmark suite with ArxivClusteringP2P, ArxivClusteringS2S, and BiorxivClusteringP2P
        category: clustering
        metrics:
          - main_score
          - v_measure
        tags:
          - embedding
          - clustering
          - mteb
          - suite

      - id: mteb_reranking
        name: Reranking Suite
        description: Passage reranking benchmark suite with AskUbuntuDupQuestions, MindSmallReranking, and SciDocsRR
        category: reranking
        metrics:
          - main_score
          - map
        tags:
          - embedding
          - reranking
          - mteb
          - suite

    parameters:
      # Common parameters for all benchmarks
      - name: batch_size
        type: integer
        default: 32
        description: Batch size for encoding

      - name: device
        type: string
        default: null
        description: Device override (cuda, cpu, mps, cuda:0)

      - name: languages
        type: array
        default:
          - eng
        description: Language codes to include (ISO 639-3)

      - name: verbosity
        type: integer
        default: 2
        description: MTEB verbosity level (0-3)

      - name: co2_tracker
        type: boolean
        default: false
        description: Enable CO2 emissions tracking

      - name: tasks
        type: array
        default: null
        description: Explicit list of MTEB task names (overrides benchmark preset)

      - name: task_types
        type: array
        default: null
        description: Filter by task type (STS, Retrieval, Classification, etc.)

Applying a ConfigMap

After downloading a ConfigMap YAML, apply it to your cluster:

oc apply -f evalhub-provider-<id>.yaml

For full OpenShift deployment steps, see OpenShift Setup.

The EvalHub operator discovers providers via ConfigMap labels:

trustyai.opendatahub.io/evalhub-provider-type
trustyai.opendatahub.io/evalhub-provider-name

You can customize namespace and label values in the detail panel before downloading.