Benchmarks

Benchmarks provide standardized performance measurements for AI models, functions, workflows, and system components. They help establish baselines, track improvements, and compare different implementations.

Overview

The Benchmarks collection offers:

Standardized performance testing
Historical performance tracking
Comparative analysis
Resource utilization metrics
Scalability assessments

Benchmark Structure

Each benchmark record contains detailed information about the test and its results:


interface Benchmark {
  id: string
  name: string
  description?: string
  type: 'model' | 'function' | 'workflow' | 'system' | 'integration'
  target: {
    id: string
    type: string
    version?: string
  }
  parameters: Record<string, any>
  metrics: {
    latency?: {
      p50: number
      p90: number
      p95: number
      p99: number
      min: number
      max: number
      mean: number
    }
    throughput?: number
    errorRate?: number
    resourceUsage?: {
      cpu?: number
      memory?: number
      gpu?: number
      network?: number
      storage?: number
    }
    customMetrics?: Record<string, number>
  }
  environment: {
    hardware?: string
    os?: string
    runtime?: string
    dependencies?: Record<string, string>
  }
  timestamp: Date
  duration: number
  status: 'completed' | 'failed' | 'aborted'
  baseline?: {
    id: string
    metrics: Record<string, number>
    comparison: Record<string, number>
  }
  metadata?: Record<string, any>
}

Benchmark Types

Benchmarks are categorized into several types:

Model Benchmarks

Evaluate AI model performance:

Accuracy
Inference latency
Token throughput
Memory usage

Function Benchmarks

Measure function performance:

Execution time
Resource utilization
Scalability
Error rates

Workflow Benchmarks

Assess workflow efficiency:

End-to-end latency
Step timing
Success rates
Resource consumption

System Benchmarks

Evaluate overall system performance:

API response times
Concurrent request handling
Database operations
Queue processing

Working with Benchmarks

Running Benchmarks

Benchmarks can be executed on-demand or scheduled:


// Example: Run a benchmark on a specific function
const result = await Benchmarks.run({
  type: 'function',
  target: {
    id: 'function-123',
    type: 'text-classification',
  },
  parameters: {
    iterations: 1000,
    concurrency: 10,
    inputSize: 'medium',
  },
})

Comparing Benchmarks

Compare performance across different versions or implementations:


// Example: Compare benchmark results
const comparison = await Benchmarks.compare({
  baseline: 'benchmark-123',
  current: 'benchmark-456',
  metrics: ['latency.p95', 'throughput', 'errorRate'],
})

Tracking Performance Over Time

Monitor performance trends:


// Example: Get historical performance data
const history = await Benchmarks.getHistory({
  target: {
    id: 'model-789',
    type: 'model',
  },
  metric: 'latency.p95',
  timeRange: {
    start: new Date('2023-01-01'),
    end: new Date('2023-12-31'),
  },
  interval: 'month',
})

Best Practices

Consistent Environment: Run benchmarks in consistent environments for valid comparisons
Comprehensive Metrics: Measure multiple aspects of performance
Statistical Significance: Ensure sufficient iterations for reliable results
Realistic Workloads: Use representative data and workloads
Regular Execution: Run benchmarks regularly to track performance over time

Integration with Other Observability Tools

Benchmarks integrate with other observability components:

Metrics: Benchmark results feed into performance metrics
Alerts: Performance regressions can trigger alerts
Dashboards: Visualize benchmark trends and comparisons
CI/CD: Integrate benchmarks into continuous integration pipelines

API Reference

List Benchmarks


GET / api / benchmarks

Query parameters:

type: Filter by benchmark type
target.id: Filter by target ID
from: Start timestamp
to: End timestamp
limit: Maximum number of benchmarks to return

Get Benchmark by ID


GET /api/benchmarks/:id

Run Benchmark


POST / api / benchmarks / run

Request body:


{
  "type": "function",
  "target": {
    "id": "function-123",
    "type": "text-classification"
  },
  "parameters": {
    "iterations": 1000,
    "concurrency": 10
  }
}

Compare Benchmarks


POST / api / benchmarks / compare

Request body:


{
  "baseline": "benchmark-123",
  "current": "benchmark-456",
  "metrics": ["latency.p95", "throughput"]
}