Skip to content

feat(bedrock-agentcore-alpha): add OnlineEvaluationConfig and Evaluator L2 constructs#37615

Open
rezabekf wants to merge 29 commits intoaws:mainfrom
rezabekf:rezabekf/agentcore-eval-construct
Open

feat(bedrock-agentcore-alpha): add OnlineEvaluationConfig and Evaluator L2 constructs#37615
rezabekf wants to merge 29 commits intoaws:mainfrom
rezabekf:rezabekf/agentcore-eval-construct

Conversation

@rezabekf
Copy link
Copy Markdown

Issue # (if applicable)

Closes #37614.

Reason for this change

Amazon Bedrock AgentCore Online Evaluation enables continuous monitoring and assessment of agent performance using live traffic. This PR adds L2 constructs for the evaluation module to the @aws-cdk/aws-bedrock-agentcore-alpha package.

CDK users can now:

  • Configure continuous evaluation of agent traces using built-in and custom evaluators
  • Control evaluation execution status (ENABLED/DISABLED) via executionStatus prop
  • Sample and filter traces for targeted evaluation
  • Integrate seamlessly with AgentCore Runtime constructs via DataSourceConfig.fromAgentRuntimeEndpoint()

Description of changes

OnlineEvaluationConfig — L2 construct backed by CfnOnlineEvaluationConfig

  • Auto-creates IAM execution role with required permissions (CloudWatch Logs read/write, Bedrock model invocation, index policies)
  • Supports executionStatus prop (ExecutionStatus.ENABLED / ExecutionStatus.DISABLED) to control whether evaluation actively processes traces
  • Accepts a mix of built-in and custom evaluators via EvaluatorReference
  • Provides fromOnlineEvaluationConfigId(), fromOnlineEvaluationConfigArn(), and fromOnlineEvaluationConfigAttributes() import methods
  • Implements IGrantable for IAM permission grants and ITaggableV2 for CDK tag propagation
  • Input validation for config name, description, evaluators count, sampling percentage, filters count, and session timeout

EvaluatorReference — Unified entry point for referencing evaluators

  • EvaluatorReference.builtin() — References one of the 13 pre-defined evaluators (e.g., HELPFULNESS, CORRECTNESS)
  • EvaluatorReference.custom() — References a user-created Evaluator construct

Evaluator — L2 construct backed by CfnEvaluator for custom evaluation logic

  • EvaluatorConfig.llmAsAJudge() — Foundation model-based evaluation with custom instructions and rating scales (categorical or numerical)
  • EvaluatorConfig.codeBased() — Lambda function-based evaluation; automatically grants scoped lambda:InvokeFunction permission with aws:SourceAccount and
    aws:SourceArn conditions (confused deputy prevention)
  • Provides fromEvaluatorId(), fromEvaluatorArn(), and fromEvaluatorAttributes() import methods
  • Input validation for evaluator name, description, rating scale options, and instructions

EvaluatorRatingScale — Factory class for custom evaluator rating scales

  • EvaluatorRatingScale.categorical() — Discrete label-based scoring (e.g., Good/Bad)
  • EvaluatorRatingScale.numerical() — Labeled numeric scoring (e.g., 1-5)

DataSourceConfig — Configuration for evaluation data sources

  • fromCloudWatchLogs() — For external agents or custom log groups
  • fromAgentRuntimeEndpoint() — Seamless integration with AgentCore Runtime (derives log group and service names automatically)

Design decisions:

  • Factory classes (EvaluatorConfig, EvaluatorRatingScale, DataSourceConfig) used instead of union types for jsii compatibility
  • modelId accepted as plain string — supports standard model IDs and cross-region inference profile IDs (e.g., us.anthropic.claude-sonnet-4-6)
  • Instructions placeholder validation delegated to the service — placeholders vary by evaluation level and may change
  • Follows existing agentcore patterns (Runtime, Memory, Gateway) with interface + base class + concrete class

Describe any new or updated permissions being added

The auto-created OnlineEvaluationConfig execution role includes:

  • CloudWatch Logs Describe (logs:DescribeLogGroups) — unscoped (*), as this action does not support resource-level restrictions
  • CloudWatch Logs Query (logs:GetQueryResults, logs:StartQuery) — scoped to user-specified log groups and the aws/spans log group
  • CloudWatch Logs Write (logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents) — scoped to
    arn:aws:logs:*:*:log-group:/aws/bedrock-agentcore/evaluations/*
  • CloudWatch Index Policy (logs:DescribeIndexPolicies, logs:PutIndexPolicy) — scoped to aws/spans log group
  • Bedrock Model Invocation (bedrock:InvokeModel, bedrock:InvokeModelWithResponseStream) — for LLM-as-a-Judge evaluators

Code-based Evaluator construct:

  • Lambda Invoke (lambda:InvokeFunction) — granted to bedrock-agentcore.amazonaws.com service principal, scoped with aws:SourceAccount and
    aws:SourceArn conditions to the specific evaluator resource

Description of how you validated changes

Unit Tests (online-evaluation.test.ts + custom-evaluator.test.ts) — 69 evaluation tests covering:

  • Creation with minimal and full props for both constructs
  • Built-in evaluators: all 13 evaluators, custom sampling and filter configurations
  • Custom evaluators: LLM-as-a-Judge (categorical/numerical scales, inference config), code-based (Lambda with timeout, scoped invoke permission)
  • EvaluatorReference.builtin() and EvaluatorReference.custom() produce correct evaluator references
  • Mixed evaluator usage in OnlineEvaluationConfig
  • executionStatus prop (ENABLED, DISABLED, omitted)
  • Input validation, token passthrough, grant and import methods for both constructs

Integration Test (integ.online-evaluation.ts)

  • Deploys OnlineEvaluationConfig with HELPFULNESS and CORRECTNESS built-in evaluators alongside a custom LLM-as-a-Judge evaluator
  • executionStatus: ENABLED
  • Deploy verified via integ-runner --update-on-failed

Rosettayarn rosetta:extract --strict passes (README and @example docstring snippets compile)

Checklist


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

rezabekf and others added 29 commits January 15, 2026 01:06
…tinuous agent evaluation

- Implements OnlineEvaluationConfig L2 construct using AwsCustomResource
- Supports 13 built-in evaluators and custom evaluator references
- Supports CloudWatch Logs and Agent Endpoint data sources
- Auto-creates IAM execution role with required permissions
- Includes sampling, filtering, and session configuration
- Provides grant methods and CloudWatch metrics
- Comprehensive unit and integration tests with 93% coverage
# Conflicts:
#	packages/@aws-cdk/aws-bedrock-agentcore-alpha/README.md
…README, align base class signature

- Remove enableOnCreate from README properties table (prop doesn't exist in code)
- Remove unused _getLogGroupNames() method from DataSourceConfig
- Remove unused validateLogGroupNames() and its constants from validation-helpers
- Accept ResourceProps in OnlineEvaluationBase constructor for consistency with other base classes
…up and service name validation

- Remove dead READ_PERMS constant from EvaluationPerms
- Add validateLogGroupNames (1-5) and validateServiceNames (>=1)
- Wire validation into DataSourceConfig.fromCloudWatchLogs()
- Add unit tests for data source validation
…aky -0 property test

- Add tests for token values skipping validation (configName, description, samplingPercentage, sessionTimeout)
- Add test for empty config name validation
- Fix flaky Property 6 by excluding -0 from number arbitrary (CFN normalizes -0 to 0)
…e type name prefix per awslint:attribute-name
…I visibility in evaluation constructs

Replace internal _render() methods with public bind() on EvaluatorReference
and DataSourceConfig so they are accessible from all JSII target languages.
Add proper return type interfaces (EvaluatorReferenceBindResult,
DataSourceConfigBindResult) since JSII requires named types.
Also fix pre-existing ValidationError constructor signatures and
attrExecutionStatus L1 API change from main merge.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…log groups in evaluation construct

Replace wildcard resource ('*') on CloudWatch Logs read permissions with
scoped log group ARNs derived from the data source configuration. Uses
Arn.format() with partition/region/account pseudo parameters for
proper cross-partition support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s from evaluation construct

Property-based tests with fast-check are not standard practice in CDK
constructs. The validation logic is already covered by unit tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…f CDK tagging aspect

Remove the tags property from OnlineEvaluationBaseProps and the manual
Record<string,string> to CfnTag[] conversion. The L1 CfnOnlineEvaluationConfig
implements ITaggableV2 with cdkTagManager, so Tags.of() works automatically.
This follows CDK best practices and avoids potential tag duplication.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…uationConfig

The `ExecutionStatus` property on `AWS::BedrockAgentCore::OnlineEvaluationConfig`
is a writable input (ENABLED/DISABLED) but was previously only read back as an
output without being passed to the L1 constructor. This adds it as an optional
input prop with a proper `ExecutionStatus` enum so users can control whether an
evaluation is enabled or disabled at creation time.

Also fixes CWL read permissions that were too aggressively scoped — splits
`DescribeLogGroups` (which doesn't support resource-level restrictions) to `*`
while keeping `StartQuery`/`GetQueryResults` scoped to specific log groups.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…uation construct

- Replace EvaluationPerms namespace with flat exports (jsii-compatible)
- Remove ADMIN_PERMS and grantAdmin() as they are control plane operations
- Extend IOnlineEvaluationConfigRef from L1 and add onlineEvaluationConfigRef getter
- Remove interface-extends-ref lint exclusion from package.json

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…online evaluation

Add the `Evaluator` L2 construct wrapping `AWS::BedrockAgentCore::Evaluator`,
supporting LLM-as-a-Judge and code-based (Lambda) evaluation strategies. Custom
evaluators integrate with `OnlineEvaluationConfig` via `EvaluatorReference.custom()`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added feature-request A feature should be added or improved. p2 beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK labels Apr 16, 2026
@aws-cdk-automation aws-cdk-automation requested a review from a team April 16, 2026 10:08
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 16, 2026

⚠️ Experimental Feature: This security report is currently in experimental phase. Results may include false positives and the rules are being actively refined.
This security report is NOT a review blocker. Please try merge from main to avoid findings unrelated to the PR.
To suppress a specific rule, see Suppressing Rules.


TestsPassed ✅SkippedFailed
Security Guardian Results48 ran48 passed
TestResult
No test annotations available

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 16, 2026

⚠️ Experimental Feature: This security report is currently in experimental phase. Results may include false positives and the rules are being actively refined.
This security report is NOT a review blocker. Please try merge from main to avoid findings unrelated to the PR.
To suppress a specific rule, see Suppressing Rules.


TestsPassed ✅SkippedFailed
Security Guardian Results with resolved templates48 ran48 passed
TestResult
No test annotations available

@aws-cdk-automation aws-cdk-automation added the pr/needs-further-review PR requires additional review from our team specialists due to the scope or complexity of changes. label Apr 16, 2026
@aws-cdk-automation aws-cdk-automation added the pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. label Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK feature-request A feature should be added or improved. p2 pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. pr/needs-further-review PR requires additional review from our team specialists due to the scope or complexity of changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

(bedrock-agentcore-alpha): add OnlineEvaluationConfig and Evaluator L2 constructs for online evaluation

2 participants