feat: Add OpenTelemetry metrics with Lambda Telemetry API integration#268
Open
mzajacsplunk wants to merge 2 commits intosignalfx:mainfrom
Open
feat: Add OpenTelemetry metrics with Lambda Telemetry API integration#268mzajacsplunk wants to merge 2 commits intosignalfx:mainfrom
mzajacsplunk wants to merge 2 commits intosignalfx:mainfrom
Conversation
Add comprehensive OpenTelemetry (OTel) metrics support alongside existing
SignalFx metrics, providing vendor-neutral OTLP implementation with 9 core
Lambda metrics plus 6 optional FaaS semantic convention metrics.
## Core Changes
**Main Integration** (cmd/splunk-extension-wrapper/splunk-extension-wrapper.go):
- Integrated OTel MeterProvider with USE_OTEL_METRICS feature flag
- Added Telemetry API subscriber initialization after extension registration
- Implemented graceful shutdown for both OTel and SignalFx systems
- Added stderr logging for CloudWatch visibility
- Automatic fallback to SignalFx on OTel initialization failure
**Extension API** (internal/extensionapi/extensionapi.go):
- Added ExtensionID() getter method for Telemetry API subscription
## New Packages
**internal/otelmetrics/** - OpenTelemetry metrics implementation:
- provider.go: MeterProvider setup with OTLP gRPC exporter
- instruments.go: 9 Lambda + 6 FaaS metric instruments
- metrics_sink.go: Telemetry event processor with state management
- sink.go: TelemetryMetricsSink interface definition
**internal/telemetry/** - Lambda Telemetry API integration:
- events.go: Telemetry event types and custom JSON unmarshaling
- subscriber.go: HTTP listener (0.0.0.0:4243) and event processor
- Handles platform.initStart/End, start, runtimeDone, report, shutdown
## Metrics Collected
**Core Lambda Metrics (9, always enabled)**:
1. lambda.function.invocation - Invocation count
2. lambda.function.initialization - Total initializations
3. lambda.function.initialization.latency - Init duration (ms)
4. lambda.function.cold_starts - On-demand cold starts
5. lambda.function.warm_starts - SnapStart warm starts
6. lambda.function.response_size - Response payload size (bytes)
7. lambda.function.snapstart.restore_duration - SnapStart restore time (ms)
8. lambda.function.shutdown - Shutdown count
9. lambda.function.lifetime - Environment lifetime (ms)
**FaaS Semantic Conventions (6, optional with OTEL_LAMBDA_EMIT_SEMCONV=true)**:
1. faas.invocations - Successful invocations
2. faas.errors - Failed invocations
3. faas.timeouts - Timed out invocations
4. faas.init_duration - Init duration histogram (seconds)
5. faas.duration - Invocation duration histogram (seconds)
6. faas.mem_usage - Memory usage histogram (bytes)
## Configuration
**Required**:
- USE_OTEL_METRICS=true - Enable OpenTelemetry metrics
**Optional**:
- OTEL_EXPORTER_OTLP_ENDPOINT=collector:4317 (default: 127.0.0.1:4317)
- OTEL_EXPORTER_OTLP_INSECURE=true (default: false)
- OTEL_LAMBDA_EMIT_SEMCONV=true (default: false)
- OTEL_EXPORTER_OTLP_HEADERS - Custom OTLP headers
## Testing
**Unit Tests** (42+ tests, all passing):
- internal/otelmetrics/provider_test.go - Provider setup and resource building
- internal/otelmetrics/metrics_sink_test.go - MetricsSink lifecycle
- internal/otelmetrics/gauge_emulation_test.go - UpDownCounter delta tracking
- internal/otelmetrics/semconv_toggle_test.go - FaaS metrics toggle
- internal/telemetry/events_test.go - JSON parsing (object & escaped string)
- internal/telemetry/subscriber_test.go - Event processing and HTTP errors
**Integration Tests** (2 tests, passing):
- test/integration/integration_test.go - Full lifecycle with OTel Collector
- Validates OTLP export, metric names, and telemetry subscriber
## Documentation
**Created**:
- docs/OTEL_QUICK_START.md - Quick start guide with examples
- internal/otelmetrics/README.md - Package documentation
- internal/telemetry/README.md - Telemetry API integration guide
- test/integration/README.md - Integration test instructions
**Cleaned Up** (deleted redundant docs):
- Removed TEST_SUMMARY.md (317 lines, info in test files)
- Removed internal/otelmetrics/METRICS_SINK.md (171 lines, code comments sufficient)
- Removed test/integration/INTEGRATION_TEST_SETUP.md (duplicate content)
- Removed docs/OTEL_METRICS_MIGRATION.md (380 lines, info in QUICK_START)
Result: 33% reduction in documentation files (15 → 10) while preserving all essential information
## Key Features
**Telemetry API Integration**:
- Listens on 0.0.0.0:4243 (subscription uses sandbox.localdomain)
- Subscribes to platform events only (JSON guaranteed)
- Buffer: 1000-10000 items, 500ms timeout (AWS requirements)
- Extracts producedBytes, restoreDurationMs, initializationType
**State Management**:
- Tracks init/invoke timestamps for duration calculation
- Delta-based lifetime tracking (UpDownCounter gauge emulation)
- Cold vs warm start detection (on-demand vs snap-start)
- Request-scoped invocation tracking
**Error Handling**:
- Graceful handling of missing/malformed telemetry data
- Automatic fallback to SignalFx if OTel fails
- Context-based shutdown with 5s timeout
- Custom JSON unmarshaling for AWS's escaped JSON strings
## Architecture
```
Lambda Extension
↓ (if USE_OTEL_METRICS=true)
Telemetry API (sandbox.localdomain:4243)
↓ platform events (JSON)
TelemetrySubscriber
↓ parsed events
MetricsSink
↓ OTel metrics
MeterProvider (5s export interval)
↓ OTLP gRPC
Any OTLP Collector
```
## Backward Compatibility
**No Breaking Changes**:
- SignalFx metrics remain default behavior
- OTel is opt-in via USE_OTEL_METRICS=true
- Existing deployments work unchanged
- Automatic fallback if OTel initialization fails
## Build & Deploy
```bash
# Build extension layer
./build-layer.sh
# Upload to AWS Lambda
aws lambda publish-layer-version \
--layer-name splunk-otel-extension \
--zip-file fileb://bin/extension.zip
# Enable on function
aws lambda update-function-configuration \
--function-name YOUR_FUNCTION \
--environment Variables="{USE_OTEL_METRICS=true,OTEL_EXPORTER_OTLP_ENDPOINT=collector:4317}"
```
**All tests passing** ✅
- Unit tests: 42+ tests across 6 test files
- Integration tests: 2 tests with live OTel Collector
- Build: Successful on macOS and Linux
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add comprehensive OpenTelemetry (OTel) metrics support alongside existing SignalFx metrics, providing vendor-neutral OTLP implementation with 9 core Lambda metrics plus 6 optional FaaS semantic convention metrics.
Core Changes
Main Integration (cmd/splunk-extension-wrapper/splunk-extension-wrapper.go):
Extension API (internal/extensionapi/extensionapi.go):
New Packages
internal/otelmetrics/ - OpenTelemetry metrics implementation:
internal/telemetry/ - Lambda Telemetry API integration:
Metrics Collected
Core Lambda Metrics (9, always enabled):
FaaS Semantic Conventions (6, optional with OTEL_LAMBDA_EMIT_SEMCONV=true):
Configuration
Required:
Optional:
Testing
Unit Tests (42+ tests, all passing):
Integration Tests (2 tests, passing):
Documentation
Created:
Key Features
Telemetry API Integration:
State Management:
Error Handling:
Architecture
Backward Compatibility
No Breaking Changes:
Build & Deploy