Skip to content

Comments

feat: Add OpenTelemetry metrics with Lambda Telemetry API integration#268

Open
mzajacsplunk wants to merge 2 commits intosignalfx:mainfrom
mzajacsplunk:michal/add-otel-metrics
Open

feat: Add OpenTelemetry metrics with Lambda Telemetry API integration#268
mzajacsplunk wants to merge 2 commits intosignalfx:mainfrom
mzajacsplunk:michal/add-otel-metrics

Conversation

@mzajacsplunk
Copy link
Contributor

Add comprehensive OpenTelemetry (OTel) metrics support alongside existing SignalFx metrics, providing vendor-neutral OTLP implementation with 9 core Lambda metrics plus 6 optional FaaS semantic convention metrics.

Core Changes

Main Integration (cmd/splunk-extension-wrapper/splunk-extension-wrapper.go):

  • Integrated OTel MeterProvider with USE_OTEL_METRICS feature flag
  • Added Telemetry API subscriber initialization after extension registration
  • Implemented graceful shutdown for both OTel and SignalFx systems
  • Added stderr logging for CloudWatch visibility
  • Automatic fallback to SignalFx on OTel initialization failure

Extension API (internal/extensionapi/extensionapi.go):

  • Added ExtensionID() getter method for Telemetry API subscription

New Packages

internal/otelmetrics/ - OpenTelemetry metrics implementation:

  • provider.go: MeterProvider setup with OTLP gRPC exporter
  • instruments.go: 9 Lambda + 6 FaaS metric instruments
  • metrics_sink.go: Telemetry event processor with state management
  • sink.go: TelemetryMetricsSink interface definition

internal/telemetry/ - Lambda Telemetry API integration:

  • events.go: Telemetry event types and custom JSON unmarshaling
  • subscriber.go: HTTP listener (0.0.0.0:4243) and event processor
  • Handles platform.initStart/End, start, runtimeDone, report, shutdown

Metrics Collected

Core Lambda Metrics (9, always enabled):

  1. lambda.function.invocation - Invocation count
  2. lambda.function.initialization - Total initializations
  3. lambda.function.initialization.latency - Init duration (ms)
  4. lambda.function.cold_starts - On-demand cold starts
  5. lambda.function.warm_starts - SnapStart warm starts
  6. lambda.function.response_size - Response payload size (bytes)
  7. lambda.function.snapstart.restore_duration - SnapStart restore time (ms)
  8. lambda.function.shutdown - Shutdown count
  9. lambda.function.lifetime - Environment lifetime (ms)

FaaS Semantic Conventions (6, optional with OTEL_LAMBDA_EMIT_SEMCONV=true):

  1. faas.invocations - Successful invocations
  2. faas.errors - Failed invocations
  3. faas.timeouts - Timed out invocations
  4. faas.init_duration - Init duration histogram (seconds)
  5. faas.duration - Invocation duration histogram (seconds)
  6. faas.mem_usage - Memory usage histogram (bytes)

Configuration

Required:

  • USE_OTEL_METRICS=true - Enable OpenTelemetry metrics

Optional:

  • OTEL_EXPORTER_OTLP_ENDPOINT=collector:4317 (default: 127.0.0.1:4317)
  • OTEL_EXPORTER_OTLP_INSECURE=true (default: false)
  • OTEL_LAMBDA_EMIT_SEMCONV=true (default: false)
  • OTEL_EXPORTER_OTLP_HEADERS - Custom OTLP headers

Testing

Unit Tests (42+ tests, all passing):

  • internal/otelmetrics/provider_test.go - Provider setup and resource building
  • internal/otelmetrics/metrics_sink_test.go - MetricsSink lifecycle
  • internal/otelmetrics/gauge_emulation_test.go - UpDownCounter delta tracking
  • internal/otelmetrics/semconv_toggle_test.go - FaaS metrics toggle
  • internal/telemetry/events_test.go - JSON parsing (object & escaped string)
  • internal/telemetry/subscriber_test.go - Event processing and HTTP errors

Integration Tests (2 tests, passing):

  • test/integration/integration_test.go - Full lifecycle with OTel Collector
  • Validates OTLP export, metric names, and telemetry subscriber

Documentation

Created:

  • docs/OTEL_QUICK_START.md - Quick start guide with examples
  • internal/otelmetrics/README.md - Package documentation
  • internal/telemetry/README.md - Telemetry API integration guide
  • test/integration/README.md - Integration test instructions

Key Features

Telemetry API Integration:

  • Listens on 0.0.0.0:4243 (subscription uses sandbox.localdomain)
  • Subscribes to platform events only (JSON guaranteed)
  • Buffer: 1000-10000 items, 500ms timeout (AWS requirements)
  • Extracts producedBytes, restoreDurationMs, initializationType

State Management:

  • Tracks init/invoke timestamps for duration calculation
  • Delta-based lifetime tracking (UpDownCounter gauge emulation)
  • Cold vs warm start detection (on-demand vs snap-start)
  • Request-scoped invocation tracking

Error Handling:

  • Graceful handling of missing/malformed telemetry data
  • Automatic fallback to SignalFx if OTel fails
  • Context-based shutdown with 5s timeout
  • Custom JSON unmarshaling for AWS's escaped JSON strings

Architecture

Lambda Extension
    ↓ (if USE_OTEL_METRICS=true)
Telemetry API (sandbox.localdomain:4243)
    ↓ platform events (JSON)
TelemetrySubscriber
    ↓ parsed events
MetricsSink
    ↓ OTel metrics
MeterProvider (5s export interval)
    ↓ OTLP gRPC
Any OTLP Collector

Backward Compatibility

No Breaking Changes:

  • SignalFx metrics remain default behavior
  • OTel is opt-in via USE_OTEL_METRICS=true
  • Existing deployments work unchanged
  • Automatic fallback if OTel initialization fails

Build & Deploy

# Build extension layer
./build-layer.sh

# Upload to AWS Lambda
aws lambda publish-layer-version \
  --layer-name splunk-otel-extension \
  --zip-file fileb://bin/extension.zip

# Enable on function
aws lambda update-function-configuration \
  --function-name YOUR_FUNCTION \
  --environment Variables="{USE_OTEL_METRICS=true,OTEL_EXPORTER_OTLP_ENDPOINT=collector:4317}"

Add comprehensive OpenTelemetry (OTel) metrics support alongside existing
SignalFx metrics, providing vendor-neutral OTLP implementation with 9 core
Lambda metrics plus 6 optional FaaS semantic convention metrics.

## Core Changes

**Main Integration** (cmd/splunk-extension-wrapper/splunk-extension-wrapper.go):
- Integrated OTel MeterProvider with USE_OTEL_METRICS feature flag
- Added Telemetry API subscriber initialization after extension registration
- Implemented graceful shutdown for both OTel and SignalFx systems
- Added stderr logging for CloudWatch visibility
- Automatic fallback to SignalFx on OTel initialization failure

**Extension API** (internal/extensionapi/extensionapi.go):
- Added ExtensionID() getter method for Telemetry API subscription

## New Packages

**internal/otelmetrics/** - OpenTelemetry metrics implementation:
- provider.go: MeterProvider setup with OTLP gRPC exporter
- instruments.go: 9 Lambda + 6 FaaS metric instruments
- metrics_sink.go: Telemetry event processor with state management
- sink.go: TelemetryMetricsSink interface definition

**internal/telemetry/** - Lambda Telemetry API integration:
- events.go: Telemetry event types and custom JSON unmarshaling
- subscriber.go: HTTP listener (0.0.0.0:4243) and event processor
- Handles platform.initStart/End, start, runtimeDone, report, shutdown

## Metrics Collected

**Core Lambda Metrics (9, always enabled)**:
1. lambda.function.invocation - Invocation count
2. lambda.function.initialization - Total initializations
3. lambda.function.initialization.latency - Init duration (ms)
4. lambda.function.cold_starts - On-demand cold starts
5. lambda.function.warm_starts - SnapStart warm starts
6. lambda.function.response_size - Response payload size (bytes)
7. lambda.function.snapstart.restore_duration - SnapStart restore time (ms)
8. lambda.function.shutdown - Shutdown count
9. lambda.function.lifetime - Environment lifetime (ms)

**FaaS Semantic Conventions (6, optional with OTEL_LAMBDA_EMIT_SEMCONV=true)**:
1. faas.invocations - Successful invocations
2. faas.errors - Failed invocations
3. faas.timeouts - Timed out invocations
4. faas.init_duration - Init duration histogram (seconds)
5. faas.duration - Invocation duration histogram (seconds)
6. faas.mem_usage - Memory usage histogram (bytes)

## Configuration

**Required**:
- USE_OTEL_METRICS=true - Enable OpenTelemetry metrics

**Optional**:
- OTEL_EXPORTER_OTLP_ENDPOINT=collector:4317 (default: 127.0.0.1:4317)
- OTEL_EXPORTER_OTLP_INSECURE=true (default: false)
- OTEL_LAMBDA_EMIT_SEMCONV=true (default: false)
- OTEL_EXPORTER_OTLP_HEADERS - Custom OTLP headers

## Testing

**Unit Tests** (42+ tests, all passing):
- internal/otelmetrics/provider_test.go - Provider setup and resource building
- internal/otelmetrics/metrics_sink_test.go - MetricsSink lifecycle
- internal/otelmetrics/gauge_emulation_test.go - UpDownCounter delta tracking
- internal/otelmetrics/semconv_toggle_test.go - FaaS metrics toggle
- internal/telemetry/events_test.go - JSON parsing (object & escaped string)
- internal/telemetry/subscriber_test.go - Event processing and HTTP errors

**Integration Tests** (2 tests, passing):
- test/integration/integration_test.go - Full lifecycle with OTel Collector
- Validates OTLP export, metric names, and telemetry subscriber

## Documentation

**Created**:
- docs/OTEL_QUICK_START.md - Quick start guide with examples
- internal/otelmetrics/README.md - Package documentation
- internal/telemetry/README.md - Telemetry API integration guide
- test/integration/README.md - Integration test instructions

**Cleaned Up** (deleted redundant docs):
- Removed TEST_SUMMARY.md (317 lines, info in test files)
- Removed internal/otelmetrics/METRICS_SINK.md (171 lines, code comments sufficient)
- Removed test/integration/INTEGRATION_TEST_SETUP.md (duplicate content)
- Removed docs/OTEL_METRICS_MIGRATION.md (380 lines, info in QUICK_START)

Result: 33% reduction in documentation files (15 → 10) while preserving all essential information

## Key Features

**Telemetry API Integration**:
- Listens on 0.0.0.0:4243 (subscription uses sandbox.localdomain)
- Subscribes to platform events only (JSON guaranteed)
- Buffer: 1000-10000 items, 500ms timeout (AWS requirements)
- Extracts producedBytes, restoreDurationMs, initializationType

**State Management**:
- Tracks init/invoke timestamps for duration calculation
- Delta-based lifetime tracking (UpDownCounter gauge emulation)
- Cold vs warm start detection (on-demand vs snap-start)
- Request-scoped invocation tracking

**Error Handling**:
- Graceful handling of missing/malformed telemetry data
- Automatic fallback to SignalFx if OTel fails
- Context-based shutdown with 5s timeout
- Custom JSON unmarshaling for AWS's escaped JSON strings

## Architecture

```
Lambda Extension
    ↓ (if USE_OTEL_METRICS=true)
Telemetry API (sandbox.localdomain:4243)
    ↓ platform events (JSON)
TelemetrySubscriber
    ↓ parsed events
MetricsSink
    ↓ OTel metrics
MeterProvider (5s export interval)
    ↓ OTLP gRPC
Any OTLP Collector
```

## Backward Compatibility

**No Breaking Changes**:
- SignalFx metrics remain default behavior
- OTel is opt-in via USE_OTEL_METRICS=true
- Existing deployments work unchanged
- Automatic fallback if OTel initialization fails

## Build & Deploy

```bash
# Build extension layer
./build-layer.sh

# Upload to AWS Lambda
aws lambda publish-layer-version \
  --layer-name splunk-otel-extension \
  --zip-file fileb://bin/extension.zip

# Enable on function
aws lambda update-function-configuration \
  --function-name YOUR_FUNCTION \
  --environment Variables="{USE_OTEL_METRICS=true,OTEL_EXPORTER_OTLP_ENDPOINT=collector:4317}"
```

**All tests passing** ✅
- Unit tests: 42+ tests across 6 test files
- Integration tests: 2 tests with live OTel Collector
- Build: Successful on macOS and Linux
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant