Skip to Content
ObservabilityBenchmarks

Benchmarks

Benchmarks provide standardized performance measurements for AI models, functions, workflows, and system components. They help establish baselines, track improvements, and compare different implementations.

Overview

The Benchmarks collection offers:

  • Standardized performance testing
  • Historical performance tracking
  • Comparative analysis
  • Resource utilization metrics
  • Scalability assessments

Benchmark Structure

Each benchmark record contains detailed information about the test and its results:

interface Benchmark { id: string name: string description?: string type: 'model' | 'function' | 'workflow' | 'system' | 'integration' target: { id: string type: string version?: string } parameters: Record<string, any> metrics: { latency?: { p50: number p90: number p95: number p99: number min: number max: number mean: number } throughput?: number errorRate?: number resourceUsage?: { cpu?: number memory?: number gpu?: number network?: number storage?: number } customMetrics?: Record<string, number> } environment: { hardware?: string os?: string runtime?: string dependencies?: Record<string, string> } timestamp: Date duration: number status: 'completed' | 'failed' | 'aborted' baseline?: { id: string metrics: Record<string, number> comparison: Record<string, number> } metadata?: Record<string, any> }

Benchmark Types

Benchmarks are categorized into several types:

Model Benchmarks

Evaluate AI model performance:

  • Accuracy
  • Inference latency
  • Token throughput
  • Memory usage

Function Benchmarks

Measure function performance:

  • Execution time
  • Resource utilization
  • Scalability
  • Error rates

Workflow Benchmarks

Assess workflow efficiency:

  • End-to-end latency
  • Step timing
  • Success rates
  • Resource consumption

System Benchmarks

Evaluate overall system performance:

  • API response times
  • Concurrent request handling
  • Database operations
  • Queue processing

Working with Benchmarks

Running Benchmarks

Benchmarks can be executed on-demand or scheduled:

// Example: Run a benchmark on a specific function const result = await Benchmarks.run({ type: 'function', target: { id: 'function-123', type: 'text-classification', }, parameters: { iterations: 1000, concurrency: 10, inputSize: 'medium', }, })

Comparing Benchmarks

Compare performance across different versions or implementations:

// Example: Compare benchmark results const comparison = await Benchmarks.compare({ baseline: 'benchmark-123', current: 'benchmark-456', metrics: ['latency.p95', 'throughput', 'errorRate'], })

Tracking Performance Over Time

Monitor performance trends:

// Example: Get historical performance data const history = await Benchmarks.getHistory({ target: { id: 'model-789', type: 'model', }, metric: 'latency.p95', timeRange: { start: new Date('2023-01-01'), end: new Date('2023-12-31'), }, interval: 'month', })

Best Practices

  1. Consistent Environment: Run benchmarks in consistent environments for valid comparisons
  2. Comprehensive Metrics: Measure multiple aspects of performance
  3. Statistical Significance: Ensure sufficient iterations for reliable results
  4. Realistic Workloads: Use representative data and workloads
  5. Regular Execution: Run benchmarks regularly to track performance over time

Integration with Other Observability Tools

Benchmarks integrate with other observability components:

  • Metrics: Benchmark results feed into performance metrics
  • Alerts: Performance regressions can trigger alerts
  • Dashboards: Visualize benchmark trends and comparisons
  • CI/CD: Integrate benchmarks into continuous integration pipelines

API Reference

List Benchmarks

GET / api / benchmarks

Query parameters:

  • type: Filter by benchmark type
  • target.id: Filter by target ID
  • from: Start timestamp
  • to: End timestamp
  • limit: Maximum number of benchmarks to return

Get Benchmark by ID

GET /api/benchmarks/:id

Run Benchmark

POST / api / benchmarks / run

Request body:

{ "type": "function", "target": { "id": "function-123", "type": "text-classification" }, "parameters": { "iterations": 1000, "concurrency": 10 } }

Compare Benchmarks

POST / api / benchmarks / compare

Request body:

{ "baseline": "benchmark-123", "current": "benchmark-456", "metrics": ["latency.p95", "throughput"] }
Last updated on