Skip to content

Performance Metrics

GuideLLM collects comprehensive performance metrics for LLM inference evaluation.

Definition: Number of successful requests processed per second

Use case: Overall system throughput measurement

Typical values:

  • Small models (1-3B): 10-50 req/s
  • Medium models (7-13B): 3-15 req/s
  • Large models (30B+): 1-5 req/s

Extracted metric: requests_per_second


Definition: Latency from request submission until the first generated token

Use case: User experience - measures perceived responsiveness

Typical values:

  • Fast: < 100ms
  • Good: 100-500ms
  • Slow: > 1000ms

Extracted metric: mean_ttft_ms

Important for: Interactive applications, chatbots, real-time systems


Definition: Time between consecutive generated tokens

Use case: Streaming quality - measures generation smoothness

Typical values:

  • Fast: < 20ms
  • Good: 20-50ms
  • Slow: > 100ms

Extracted metric: mean_itl_ms

Important for: Streaming responses, user experience


Definition: Tokens generated per second

Types:

  • Prompt tokens/sec: Input processing rate
  • Output tokens/sec: Generation rate
  • Total tokens/sec: Combined throughput

Extracted metrics:

  • prompt_tokens_per_second
  • output_tokens_per_second

Use case: Cost estimation, capacity planning


Definition: End-to-end time from request to complete response

Calculation: TTFT + (num_tokens × ITL)

Use case: Overall performance measurement

Typical values:

  • Interactive: < 2s
  • Batch processing: 5-30s

Definition: Count of successful requests processed

Extracted metric: total_requests

Use case: Validation, sample size verification

All metrics include statistical measures:

  • Mean: Average value
  • Median: Middle value (50th percentile)
  • Standard Deviation: Variability measure
  • Percentiles: Distribution (p50, p75, p90, p95, p99)

The adapter extracts summary statistics from GuideLLM’s nested output structure:

{
"framework": "guidellm",
"benchmark_id": "performance_quick",
"requests_per_second": 4.99,
"prompt_tokens_per_second": 263.17,
"output_tokens_per_second": 105.27,
"mean_ttft_ms": 0.0,
"mean_itl_ms": 0.0,
"total_requests": 20,
"benchmark_count": 1
}

Low TTFT (< 200ms) - Responsive feel ✓ Consistent ITL (low std dev) - Smooth streaming ✓ High token throughput - Efficient generation ✓ Stable request rate - No saturation

High TTFT (> 1s) - Poor responsiveness ⚠ Variable ITL (high std dev) - Stuttering generation ⚠ Low throughput - Under-utilised resources ⚠ Increasing latency - Approaching saturation

GuideLLM generates multiple output formats with detailed metrics:

Complete authoritative record with all metrics and sample requests.

File: benchmarks.json

Contains:

  • All statistical measures
  • Request-level data
  • System metadata
  • Configuration details

Tabular view for spreadsheets and BI tools.

File: benchmarks.csv

Columns: All metrics flattened with mean/median/std/percentiles

Visual summary with latency distributions and interactive charts.

File: benchmarks.html

Includes:

  • Performance summary tables
  • Latency distribution graphs
  • Token throughput visualisations
  • Request timeline

Human-readable alternative to JSON format.

File: benchmarks.yaml

Use case: Configuration review, documentation

Here’s a sample benchmark result:

Token Metrics (Completed Requests)
┌────────────┬──────┬──────┬──────┬──────┬──────┬──────┬───────┬──────┬─────────┬────────┐
│ Benchmark │ Prompt Tokens ││ Generated Tokens ││ Total Tokens ││ Iterations ││
│ Strategy │ Per Request ││ Per Request ││ Per Request ││ Per Request ││
│ │ Mdn │ p95 │ Mdn │ p95 │ Mdn │ p95 │ Mdn │ p95 │ Mdn │ p95 │
├────────────┼──────┼──────┼──────┼──────┼──────┼──────┼───────┼──────┼─────────┼────────┤
│ constant │ 50.0 │ 50.0 │ 20.0 │ 20.0 │ 70.0 │ 70.0 │ 1.0 │ 1.0 │ 20.0 │ 20.0 │
└────────────┴──────┴──────┴──────┴──────┴──────┴──────┴───────┴──────┴─────────┴────────┘
Server Throughput Statistics
┌────────────┬──────┬───────┬─────────┬───────┬─────────┬────────┬──────────┬────────┐
│ Benchmark │ Requests ││ Input Tokens ││ Output Tokens ││ Total Tokens ││
│ Strategy │ Per Sec ││ Per Sec ││ Per Sec ││ Per Sec ││
│ │ Mdn │ Mean │ Mdn │ Mean │ Mdn │ Mean │ Mdn │ Mean │
├────────────┼──────┼───────┼────────┼───────┼─────────┼────────┼─────────┼─────────┤
│ constant │ 5.0 │ 5.0 │ 250.3 │ 263.2 │ 100.1 │ 105.3 │ 350.5 │ 368.4 │
└────────────┴──────┴───────┴────────┴───────┴─────────┴────────┴─────────┴─────────┘

All metrics are:

  1. Sent to eval-hub service: Summary metrics for tracking and comparison
  2. Persisted as OCI artifacts: Complete results for detailed analysis
  3. Logged: Real-time visibility during benchmark execution