Performance Metrics
GuideLLM collects comprehensive performance metrics for LLM inference evaluation.
Core Metrics
Section titled “Core Metrics”Requests Per Second
Section titled “Requests Per Second”Definition: Number of successful requests processed per second
Use case: Overall system throughput measurement
Typical values:
- Small models (1-3B): 10-50 req/s
- Medium models (7-13B): 3-15 req/s
- Large models (30B+): 1-5 req/s
Extracted metric: requests_per_second
Time to First Token (TTFT)
Section titled “Time to First Token (TTFT)”Definition: Latency from request submission until the first generated token
Use case: User experience - measures perceived responsiveness
Typical values:
- Fast: < 100ms
- Good: 100-500ms
- Slow: > 1000ms
Extracted metric: mean_ttft_ms
Important for: Interactive applications, chatbots, real-time systems
Inter-Token Latency (ITL)
Section titled “Inter-Token Latency (ITL)”Definition: Time between consecutive generated tokens
Use case: Streaming quality - measures generation smoothness
Typical values:
- Fast: < 20ms
- Good: 20-50ms
- Slow: > 100ms
Extracted metric: mean_itl_ms
Important for: Streaming responses, user experience
Token Throughput
Section titled “Token Throughput”Definition: Tokens generated per second
Types:
- Prompt tokens/sec: Input processing rate
- Output tokens/sec: Generation rate
- Total tokens/sec: Combined throughput
Extracted metrics:
prompt_tokens_per_secondoutput_tokens_per_second
Use case: Cost estimation, capacity planning
Request Latency
Section titled “Request Latency”Definition: End-to-end time from request to complete response
Calculation: TTFT + (num_tokens × ITL)
Use case: Overall performance measurement
Typical values:
- Interactive: < 2s
- Batch processing: 5-30s
Total Requests
Section titled “Total Requests”Definition: Count of successful requests processed
Extracted metric: total_requests
Use case: Validation, sample size verification
Statistical Measures
Section titled “Statistical Measures”All metrics include statistical measures:
- Mean: Average value
- Median: Middle value (50th percentile)
- Standard Deviation: Variability measure
- Percentiles: Distribution (p50, p75, p90, p95, p99)
Metric Extraction
Section titled “Metric Extraction”The adapter extracts summary statistics from GuideLLM’s nested output structure:
{ "framework": "guidellm", "benchmark_id": "performance_quick", "requests_per_second": 4.99, "prompt_tokens_per_second": 263.17, "output_tokens_per_second": 105.27, "mean_ttft_ms": 0.0, "mean_itl_ms": 0.0, "total_requests": 20, "benchmark_count": 1}Interpreting Results
Section titled “Interpreting Results”Good Performance Indicators
Section titled “Good Performance Indicators”✓ Low TTFT (< 200ms) - Responsive feel ✓ Consistent ITL (low std dev) - Smooth streaming ✓ High token throughput - Efficient generation ✓ Stable request rate - No saturation
Performance Issues
Section titled “Performance Issues”⚠ High TTFT (> 1s) - Poor responsiveness ⚠ Variable ITL (high std dev) - Stuttering generation ⚠ Low throughput - Under-utilised resources ⚠ Increasing latency - Approaching saturation
Benchmark Output
Section titled “Benchmark Output”GuideLLM generates multiple output formats with detailed metrics:
JSON Output
Section titled “JSON Output”Complete authoritative record with all metrics and sample requests.
File: benchmarks.json
Contains:
- All statistical measures
- Request-level data
- System metadata
- Configuration details
CSV Output
Section titled “CSV Output”Tabular view for spreadsheets and BI tools.
File: benchmarks.csv
Columns: All metrics flattened with mean/median/std/percentiles
HTML Output
Section titled “HTML Output”Visual summary with latency distributions and interactive charts.
File: benchmarks.html
Includes:
- Performance summary tables
- Latency distribution graphs
- Token throughput visualisations
- Request timeline
YAML Output
Section titled “YAML Output”Human-readable alternative to JSON format.
File: benchmarks.yaml
Use case: Configuration review, documentation
Example Output
Section titled “Example Output”Here’s a sample benchmark result:
Token Metrics (Completed Requests)┌────────────┬──────┬──────┬──────┬──────┬──────┬──────┬───────┬──────┬─────────┬────────┐│ Benchmark │ Prompt Tokens ││ Generated Tokens ││ Total Tokens ││ Iterations │││ Strategy │ Per Request ││ Per Request ││ Per Request ││ Per Request │││ │ Mdn │ p95 │ Mdn │ p95 │ Mdn │ p95 │ Mdn │ p95 │ Mdn │ p95 │├────────────┼──────┼──────┼──────┼──────┼──────┼──────┼───────┼──────┼─────────┼────────┤│ constant │ 50.0 │ 50.0 │ 20.0 │ 20.0 │ 70.0 │ 70.0 │ 1.0 │ 1.0 │ 20.0 │ 20.0 │└────────────┴──────┴──────┴──────┴──────┴──────┴──────┴───────┴──────┴─────────┴────────┘
Server Throughput Statistics┌────────────┬──────┬───────┬─────────┬───────┬─────────┬────────┬──────────┬────────┐│ Benchmark │ Requests ││ Input Tokens ││ Output Tokens ││ Total Tokens │││ Strategy │ Per Sec ││ Per Sec ││ Per Sec ││ Per Sec │││ │ Mdn │ Mean │ Mdn │ Mean │ Mdn │ Mean │ Mdn │ Mean │├────────────┼──────┼───────┼────────┼───────┼─────────┼────────┼─────────┼─────────┤│ constant │ 5.0 │ 5.0 │ 250.3 │ 263.2 │ 100.1 │ 105.3 │ 350.5 │ 368.4 │└────────────┴──────┴───────┴────────┴───────┴─────────┴────────┴─────────┴─────────┘Metric Persistence
Section titled “Metric Persistence”All metrics are:
- Sent to eval-hub service: Summary metrics for tracking and comparison
- Persisted as OCI artifacts: Complete results for detailed analysis
- Logged: Real-time visibility during benchmark execution