Skip to content

[cost monitoring] Configure and roll out cost-monitoring across AWS clusters #6515

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

jnywong
Copy link
Member

@jnywong jnywong commented Aug 6, 2025

Closes #6519

Summary

  • Transfers aws-ce-grafana-backend deployment into a helm dependency on a standalone repo: https://github.com/2i2c-org/jupyterhub-cost-monitoring/
  • Introduces the aws.active_cost_tags key in cluster.yaml so we know when to deploy the resources for cost monitoring based on whether cost allocation tags are active (programatically in terraform for AWS accounts we manage, but manually on our behalf for AWS accounts where only communities have access to billing)
  • Consolidates cost monitoring terraform config, such as creating IAM role and cost allocation tag resources, into a single file
  • We unconditionally provide the AWS Account ID (12 digit number) and use this to deterministically define jupyterhub-cost-monitoring k8s service account annotation with support/values.jsonnet
  • Updates GH actions due to config/clusters rename for openscapeshub and nasa-ghg-hub

Breaking changes

  • renamed openscapes to openscapeshub and nasa-ghg to nasa-ghg-hub for consistency with AWS cluster names
    • they were the only AWS clusters where the name key in the cluster.yaml did not match the aws.clusterName key
    • needed to be renamed because we want to pass the aws.clusterName value into the support/values.jsonnet to configure the standalone cost monitoring application
    • since aws.clusterName is immutable, we rename references in other configs for consistency
      • engineers need to update AWS profile names in ~/.aws/config and ~/.aws/credentials on their local machines
      • two-eye-two-see-org-terraform-state/terraform/state/pilot-hubstfstate files need to be renamed after merging this PR ⚠️
  • we use the AWS Account ID (12 digit number) to deterministically define jupyterhub-cost-monitoring IAM role
    • aws.account key in cluster.yaml must be populated with the 12 digit account ID

Copy link

github-actions bot commented Aug 6, 2025

Merging this PR will trigger the following deployment actions.

Support deployments

Cloud Provider Cluster Name Reason for Redeploy
gcp leap Support helm chart has been modified
kubeconfig utoronto Support helm chart has been modified
aws disasters Support helm chart has been modified
gcp catalystproject-latam Support helm chart has been modified
gcp 2i2c Support helm chart has been modified
aws strudel Support helm chart has been modified
aws maap Support helm chart has been modified
aws reflective Support helm chart has been modified
gcp cloudbank Support helm chart has been modified
aws jupyter-health Support helm chart has been modified
aws nasa-cryo Support helm chart has been modified
aws nasa-veda Support helm chart has been modified
gcp awi-ciroh Support helm chart has been modified
gcp 2i2c-uk Support helm chart has been modified
aws openscapeshub Support helm chart has been modified
aws catalystproject-africa Support helm chart has been modified
aws nmfs-openscapes Support helm chart has been modified
aws berkeley-geojupyter Support helm chart has been modified
aws neurohackademy Support helm chart has been modified
kubeconfig projectpythia-binder Support helm chart has been modified
kubeconfig 2i2c-jetstream2 Support helm chart has been modified
aws opensci Support helm chart has been modified
aws smithsonian Support helm chart has been modified
gcp climatematch Support helm chart has been modified
aws projectpythia Support helm chart has been modified
aws 2i2c-aws-us Support helm chart has been modified
aws victor Support helm chart has been modified
gcp hhmi Support helm chart has been modified
aws earthscope Support helm chart has been modified
gcp dubois Support helm chart has been modified
aws ubc-eoas Support helm chart has been modified
aws nasa-ghg-hub Support helm chart has been modified

Staging deployments

Cloud Provider Cluster Name Hub Name Reason for Redeploy
gcp leap staging Core infrastructure has been modified
kubeconfig utoronto staging Core infrastructure has been modified
kubeconfig utoronto r-staging Core infrastructure has been modified
aws disasters staging Core infrastructure has been modified
gcp catalystproject-latam staging Core infrastructure has been modified
gcp 2i2c staging Core infrastructure has been modified
gcp 2i2c dask-staging Core infrastructure has been modified
gcp 2i2c ucmerced-staging Core infrastructure has been modified
aws strudel staging Core infrastructure has been modified
aws maap staging Core infrastructure has been modified
aws reflective staging Core infrastructure has been modified
gcp cloudbank staging Core infrastructure has been modified
aws jupyter-health staging Core infrastructure has been modified
aws nasa-cryo staging Core infrastructure has been modified
aws nasa-veda staging Core infrastructure has been modified
gcp awi-ciroh staging Core infrastructure has been modified
gcp 2i2c-uk staging Core infrastructure has been modified
aws openscapeshub staging Core infrastructure has been modified
aws catalystproject-africa staging Core infrastructure has been modified
aws nmfs-openscapes staging Core infrastructure has been modified
aws berkeley-geojupyter staging Core infrastructure has been modified
aws neurohackademy staging Core infrastructure has been modified
kubeconfig 2i2c-jetstream2 staging Core infrastructure has been modified
aws opensci staging Core infrastructure has been modified
aws smithsonian staging Core infrastructure has been modified
gcp climatematch staging Core infrastructure has been modified
aws projectpythia staging Core infrastructure has been modified
aws 2i2c-aws-us staging Core infrastructure has been modified
aws 2i2c-aws-us dask-staging Core infrastructure has been modified
aws victor staging Core infrastructure has been modified
gcp hhmi staging Core infrastructure has been modified
aws earthscope staging Core infrastructure has been modified
aws ubc-eoas staging Core infrastructure has been modified
aws nasa-ghg-hub staging Core infrastructure has been modified

Production deployments

Cloud Provider Cluster Name Hub Name Reason for Redeploy
gcp leap prod Core infrastructure has been modified
gcp leap public Core infrastructure has been modified
kubeconfig utoronto prod Core infrastructure has been modified
kubeconfig utoronto r-prod Core infrastructure has been modified
kubeconfig utoronto highmem Core infrastructure has been modified
aws disasters prod Core infrastructure has been modified
gcp catalystproject-latam unitefa-conicet Core infrastructure has been modified
gcp catalystproject-latam cicada Core infrastructure has been modified
gcp catalystproject-latam gita Core infrastructure has been modified
gcp catalystproject-latam iner Core infrastructure has been modified
gcp catalystproject-latam plnc Core infrastructure has been modified
gcp catalystproject-latam unam Core infrastructure has been modified
gcp catalystproject-latam cabana Core infrastructure has been modified
gcp catalystproject-latam nnb-ccg Core infrastructure has been modified
gcp catalystproject-latam labi Core infrastructure has been modified
gcp catalystproject-latam areciboc3 Core infrastructure has been modified
gcp catalystproject-latam valledellili Core infrastructure has been modified
gcp 2i2c imagebuilding-demo Core infrastructure has been modified
gcp 2i2c binderhub-ui-demo Core infrastructure has been modified
gcp 2i2c demo Core infrastructure has been modified
gcp 2i2c temple Core infrastructure has been modified
gcp 2i2c ucmerced Core infrastructure has been modified
gcp 2i2c mtu Core infrastructure has been modified
aws strudel prod Core infrastructure has been modified
aws maap prod Core infrastructure has been modified
aws reflective prod Core infrastructure has been modified
aws reflective workshop Core infrastructure has been modified
gcp cloudbank authoring Core infrastructure has been modified
gcp cloudbank bcc Core infrastructure has been modified
gcp cloudbank chaffey Core infrastructure has been modified
gcp cloudbank ccsf Core infrastructure has been modified
gcp cloudbank chabot Core infrastructure has been modified
gcp cloudbank csm Core infrastructure has been modified
gcp cloudbank csum Core infrastructure has been modified
gcp cloudbank demo Core infrastructure has been modified
gcp cloudbank dvc Core infrastructure has been modified
gcp cloudbank elac Core infrastructure has been modified
gcp cloudbank elcamino Core infrastructure has been modified
gcp cloudbank evc Core infrastructure has been modified
gcp cloudbank fresno Core infrastructure has been modified
gcp cloudbank foothill Core infrastructure has been modified
gcp cloudbank glendale Core infrastructure has been modified
gcp cloudbank golden Core infrastructure has been modified
gcp cloudbank high Core infrastructure has been modified
gcp cloudbank humboldt Core infrastructure has been modified
gcp cloudbank lacc Core infrastructure has been modified
gcp cloudbank lahc Core infrastructure has been modified
gcp cloudbank laney Core infrastructure has been modified
gcp cloudbank lavc Core infrastructure has been modified
gcp cloudbank lbcc Core infrastructure has been modified
gcp cloudbank mendocino Core infrastructure has been modified
gcp cloudbank merced Core infrastructure has been modified
gcp cloudbank merritt Core infrastructure has been modified
gcp cloudbank miracosta Core infrastructure has been modified
gcp cloudbank mission Core infrastructure has been modified
gcp cloudbank moreno Core infrastructure has been modified
gcp cloudbank norco Core infrastructure has been modified
gcp cloudbank palomar Core infrastructure has been modified
gcp cloudbank pasadena Core infrastructure has been modified
gcp cloudbank redwoods Core infrastructure has been modified
gcp cloudbank reedley Core infrastructure has been modified
gcp cloudbank riohondo Core infrastructure has been modified
gcp cloudbank saddleback Core infrastructure has been modified
gcp cloudbank sbcc Core infrastructure has been modified
gcp cloudbank sbcc-dev Core infrastructure has been modified
gcp cloudbank sierra Core infrastructure has been modified
gcp cloudbank sjcc Core infrastructure has been modified
gcp cloudbank sjsu Core infrastructure has been modified
gcp cloudbank skyline Core infrastructure has been modified
gcp cloudbank srjc Core infrastructure has been modified
gcp cloudbank tuskegee Core infrastructure has been modified
gcp cloudbank ucsc Core infrastructure has been modified
gcp cloudbank wlac Core infrastructure has been modified
aws jupyter-health prod Core infrastructure has been modified
aws nasa-cryo prod Core infrastructure has been modified
aws nasa-veda prod Core infrastructure has been modified
aws nasa-veda binder Core infrastructure has been modified
gcp awi-ciroh prod Core infrastructure has been modified
gcp awi-ciroh workshop Core infrastructure has been modified
gcp 2i2c-uk lis Core infrastructure has been modified
aws openscapeshub prod Core infrastructure has been modified
aws openscapeshub workshop Core infrastructure has been modified
aws catalystproject-africa nm-aist Core infrastructure has been modified
aws catalystproject-africa must Core infrastructure has been modified
aws catalystproject-africa uvri Core infrastructure has been modified
aws catalystproject-africa wits Core infrastructure has been modified
aws catalystproject-africa kush Core infrastructure has been modified
aws catalystproject-africa molerhealth Core infrastructure has been modified
aws catalystproject-africa aibst Core infrastructure has been modified
aws catalystproject-africa bhki Core infrastructure has been modified
aws catalystproject-africa bon Core infrastructure has been modified
aws nmfs-openscapes prod Core infrastructure has been modified
aws nmfs-openscapes workshop Core infrastructure has been modified
aws nmfs-openscapes noaa-only Core infrastructure has been modified
aws berkeley-geojupyter prod Core infrastructure has been modified
aws neurohackademy prod Core infrastructure has been modified
kubeconfig projectpythia-binder binderhub Core infrastructure has been modified
aws opensci sciencecore Core infrastructure has been modified
aws opensci climaterisk Core infrastructure has been modified
aws opensci small-binder Core infrastructure has been modified
aws opensci big-binder Core infrastructure has been modified
aws smithsonian prod Core infrastructure has been modified
gcp climatematch prod Core infrastructure has been modified
aws projectpythia prod Core infrastructure has been modified
aws projectpythia pythia-binder Core infrastructure has been modified
aws 2i2c-aws-us showcase Core infrastructure has been modified
aws victor prod Core infrastructure has been modified
gcp hhmi spyglass Core infrastructure has been modified
gcp hhmi binder Core infrastructure has been modified
aws earthscope prod Core infrastructure has been modified
aws earthscope binder Core infrastructure has been modified
gcp dubois ephemeral Core infrastructure has been modified
aws ubc-eoas prod Core infrastructure has been modified
aws nasa-ghg-hub prod Core infrastructure has been modified
aws nasa-ghg-hub binder Core infrastructure has been modified

@jnywong jnywong force-pushed the standalone-cost-monitoring branch from 9732fad to 419a510 Compare August 7, 2025 15:55
@@ -1,4 +1,6 @@
local cluster_name = std.extVar('2I2C_VARS.CLUSTER_NAME');
local provider_name = std.extVar('2I2C_VARS.PROVIDER');
local aws_account_id = std.extVar('2I2C_VARS.AWS_ACCOUNT_ID');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment on what this variable will resolve to if we aren't on AWS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment: "# undefined if provider_name != 'aws'"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fine, since the variable is currently only needed if provider_name == 'aws' in

'jupyterhub-cost-monitoring': if provider_name == 'aws' then configCostMonitoring else { enabled: false },

Copy link
Member Author

@jnywong jnywong Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually! I have changed the approach to include aws_account_id as a top-level argument to handle the case where this is undefined for non-AWS cloud providers.

This is passed to the jsonnet via the deployer deploy-support command based on the new key aws.active_cost_tags under the cluster.yaml file.

@yuvipanda
Copy link
Member

Two comments, otherwise this looks good!

Do test with staging on both openscapes and ghg where we are doing a rename to make sure that they work correctly.

@jnywong jnywong force-pushed the standalone-cost-monitoring branch 2 times, most recently from 2a45e30 to 9b7885d Compare August 8, 2025 15:39
@jnywong jnywong marked this pull request as ready for review August 8, 2025 16:02
@jnywong jnywong requested a review from yuvipanda August 8, 2025 16:02
@jnywong
Copy link
Member Author

jnywong commented Aug 8, 2025

@yuvipanda i have tested this on openscapeshub and nasa-ghg-hub and all looks good. If you decide to approve this, then I can slowly (and calmly :D) apply the terraform changes across all our AWS clusters over the weekend when user activity is low in case something gets broken.

Copy link
Member

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks great! I want to expound a principle for our cluster.yaml and then ask for a specific change here.

The principle is that 'we should keep things in cluster.yaml as minimal as possible'. This means that every time we add something in there, we should check to see 'can this be done by another means?'. The primary goal here is to rely on things that are upstream (helm, z2jh) as much as possible. My intuition here makes me want to be particularly careful with things that are bools. billing.paid_by_us is not really used anywhere for example.

In this case, it looks like the only point of active_cost_tags is to enable or disable cost monitoring. If we were to apply the previous principle here, we can accomplish that goal with helm directly. support/values.jsonnet takes precedence over any support.values.yaml in a particular cluster. So I suggest we:

  1. Remove billing.active_cost_tags as a field
  2. Document that for cases where we don't have cost tags enabled, we should set jupyterhub-cost-monitoring.enabled to false in support.values.yaml
  3. Set this wherever needed.

This also means we don't have to have a field in cluster.yaml that's true most of the time!

if provider is not None:
command += ["--ext-str", f"2I2C_VARS.PROVIDER={provider}"]
if aws_account_id is not None:
command += ["--tla-str", f"aws_account_id={aws_account_id}"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use tla-str rather than ext-str here? If so, document it and if not let's use the same one everywhere.

Let's also use 2I2C_VARS.<ALL_CAPS> as the format for variables we pass in? If that's not something we can do for tla let's find an equivalent one that very clearly marks it as a 2i2c specific variable we are passing in

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tla-str can be referenced in the support/values.jsonnet if they are undefined, which is the case for support charts for non-AWS clusters. I've wrapped everything but the global --ext-str variables into a function so that the tla-str can be passed in.

tla-str does not accept the format proposed, so I have tweaked our 2i2c variable names to: VAR_2I2C_<variable_name>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
if (
self.spec["provider"] == "aws"
and self.spec["aws"]["billing"]["active_cost_tags"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can pass this account id in unconditionally on all AWS things regardless of wether we enable cost monitoring or not. Let's generally try to keep the knowledge about what's in the cluster files and conditional logic as minimal as possible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

},
},
},
}
'jupyterhub-cost-monitoring': if std.type(aws_account_id) != 'null' then configCostMonitoring(aws_account_id) else { enabled: false },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's unconditionally enable this on AWS here by checking on provider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! I've checked that the jupyterhub-cost-monitoring.enabled: false override in the support/values.yaml works as expected and clobbers the jsonnet config.

@jnywong jnywong force-pushed the standalone-cost-monitoring branch from a970265 to b5406b4 Compare August 10, 2025 14:36
@jnywong jnywong changed the title [cost monitoring] Encapsulate repeatable configs as jsonnet [cost monitoring] Configure and roll out cost-monitoring across AWS clusters Aug 10, 2025
@jnywong
Copy link
Member Author

jnywong commented Aug 11, 2025

Okay, requested changes made!

Thanks for the review – I got too laser-focused on "how much can i generalise this with jsonnet?" and ended up with too much conditional logic like you said. Absolutely config/clusters/<cluster_name>/support.values.yaml should be the last line for explicitly disabling the system, nice!

  1. Document that for cases where we don't have cost tags enabled, we should set jupyterhub-cost-monitoring.enabled to false in support.values.yaml

Docs are updated in #6488 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cost monitoring backend can be installed as a standalone helm-chart repository (phase 2)
2 participants