Adds versions to timeseries schema #3996

bnaecker · 2023-08-30T20:55:03Z

Adds a version to the Target and Metric traits in oximeter. This is used to track the version of the schema defined by those types. It defaults to 1.
Adds better support tweaking the derived impls, allowing configuration of the name, description, version, and datum. This is partially to provide scaffolding for things like units in the future.
Adds utilities for testing that a schema change is accompanied by a version change. This is basically expectorate, tailored for this use case.
Adds a schema change test to the HTTP service metrics used in Nexus.
Makes the TimeseriesName a proper part of oximeter core, and includes the version numbers for the target / metric. These can be provided by a structured string like target_name[@target_version]:metric_name[@metric_version], where the versions default to 1 to match the derived trait. The versions are not yet stored in the DB, so we transfer an unversioned name to the DB for now.

bnaecker · 2023-08-30T20:58:51Z

This resolves the first part of #1293, adding versions into just the traits used to describe metrics themselves. The next chunk is adding them to the database. That involves a schema change, and is somewhat blocked on (1) tooling to make such changes in ClickHouse (like #3760 ), and (2) the move to replicated tables in #3494.

- Adds a version to the `Target` and `Metric` traits in `oximeter`. This is used to track the version of the schema defined by those types. It defaults to 1. - Adds better support tweaking the derived impls, allowing configuration of the name, description, version, and datum. This is partially to provide scaffolding for things like units in the future. - Adds utilities for testing that a schema change is accompanied by a version change. This is basically expectorate, tailored for this use case. - Adds a schema change test to the HTTP service metrics used in Nexus. - Makes the `TimeseriesName` a proper part of `oximeter` core, and includes the version numbers for the target / metric. These can be provided by a structured string like `target_name[@target_version]:metric_name[@metric_version]`, where the versions default to 1 to match the derived trait. The versions are _not_ yet stored in the DB, so we transfer an unversioned name to the DB for now.

oximeter/db/src/client.rs

oximeter/db/src/model.rs

oximeter/oximeter-macro-impl/src/lib.rs

oximeter/oximeter/src/name.rs

smklein · 2023-08-31T20:40:41Z

oximeter/oximeter/src/test_util.rs

+    check_contents(path, actual)
+}
+
+fn check_contents(


Any reason to not use the expectorate crate itself? Why mimic the behavior?

Yeah. The issue is that I don't want to just assert the contents are the same. I want to assert that, if they're different, the version number must have changed. I thought about catching the panic in expectorate, but it just seemed complicated. I'll take another look though, I may be splitting hairs.

smklein · 2023-08-31T20:43:24Z

oximeter/oximeter/src/test_util.rs

+    if expected_version.as_i64().unwrap() >= actual_version.as_i64().unwrap()
+        && !is_overwrite
+    {


If "expected >= actual && !overwrite", we throw an error. That makes sense to me.

But if "expected < actual", why don't we also throw an error? Why is that case allowed?

The idea was that you've properly increased the version along with a change to the schema. This is part of why I didn't just use expectorate here: I want to check that the contents are the same, or if they're not, the version number is larger. Part of my worry was that it seems likely that folks just kinda put EXPECTORATE=overwrite whenever they see a mismatch in this kind of situation. I was hoping it'd be better for folks to see an error that they need to bump the version, and then do that, at which point the update will happen automatically.

But it seems like setting EXPECTORATE=overwrite obliterates the stored thing anyway -- so ironically, if someone skims the error, sees a diff, and sets that environment variable, they'll toss out the old stored JSON version and just use whatever value came from the in-memory struct regardless.

This could arguably even happen accidentally, with someone changing an unrelated openapi definition, but running EXPECTORATE=ovewrite cargo test and trying to bump everything.

I know we're trying to go for a better user experience here, but I think the guidance is on "what you should change" is more subtle than "compare what happened to exist in this file before". I think this is complicated by the file not always representing "what is stored on upstream" -- If I update http-service.json to use version=2, and then update it again before merging, showing the diff with this file isn't quite right, I actually care about the diff from this file on main.

FWIW, I am also interested in @ahl 's thoughts here, since our current policy everywhere else is:

Store the schema explicitly

Override the schema entirely

Rely on humans to check the diff during code review, and confirm that these migrations are okay

I agree with all that, and I definitely waffled when implementing it. So the goal is to make sure that folks don't change the schema willy-nilly. If we do, we really want to make sure that (1) they are aware of that and (2) they've definitely bumped the version.

If we're relying on humans checking the diff, then straight-up expectorate seems like the easiest / simplest approach. Adam's thoughts would definitely be helpful here.

If I update http-service.json to use version=2, and then update it again before merging, showing the diff with this file isn't quite right, I actually care about the diff from this file on main.

So for the multiple updates issue, we could explicitly compare with the checked in contents. I'm not sure if that's better, or just more magical though.

Yeah, I was on the fence about whether or not to suggest that, but it does seem like it encapsulates what we care about:

Does your "actual struct" match your "schema file"? (e.g., do you acknowledge you are changing the schema, in some way?)

Does your schema change (diff with main) update the version, if any diff exists? (e.g., if you're changing the schema, are you doing it the right way?)

I talked with Adam about some of this. He threw out a possibly crazy idea: suppose we don't have human-friendly versions for these, but we use a digest of the timeseries schema itself?

Some drawbacks

Lose human-friendly version numbers. Kind of annoying to look at target@111683034273293 instead of target@2

The ordering of the versions has to be done by the time they're created (stored in the DB), not the versions

There is no default. Since we're adding versions now, we can't really generate any meaningful default for those schema which lack versions. This is the big one, IMO. I think we'd need to just blow away data prior to this, or make 0 special and hope no schema actually hash to that.

We need to pick a stable hash, something like a SHA sum, since the hash implementation in the Rust standard library could change.

Some benefits

It's not really possible to change the schema and not change the version. We may still rely on a file for developers to ensure that their schema only changes when they want it to, but it's not possible to hit schema mismatches in the database when trying to insert new data.

Developers don't have to worry about the version for the most part. It changes when it needs to.

This seems like it might make querying harder, but I don't believe it actually does. We can still support querying a specific version of the timeseries, a client just needs to use a bigger number to do that. But supporting a querying against the timeseries agnostic to the version is no different. We probably still want to go look at all the versions for the timeseries to resolve those that are consistent with the query (if any).

oximeter/instruments/schema/http-service.json

bnaecker · 2023-09-07T16:03:45Z

I've been experimenting with this for a few days, and I think it's time to actually just drop it. I wanted to include some thoughts.

First, I like the idea of content-based versions. Hashing the relevant parts of the schema has a lot of nice properties. It can be made stable; it's automatic, so the developer doesn't need to manage the version manually; and it doesn't make querying any harder than if we support manual versioning. XX Hash is a good choice for non-crypto hashes. SHA2 is a good choice for crypto-hashes, although BLAKE3 is pretty interesting. One thing to consider here is if we need to compute these hashes in ClickHouse. Their implementation of XXHash64 is unfortunately a bit non-standard, and it's effectively impossible to guarantee we can compute the same values in Rust and CH. That would point towards a more widely-used standard like SHA2. BLAKE3 support in ClickHouse needs to be enabled at build-time, and our build of ClickHouse does not do so. That's an easy fix if we want to go that route.

The second main issue I wanted to raise is around schema updates in the database itself. We probably want to add columns to the timeseries_schema table for the target and metric schema hash / ID. But we have no real tooling for doing that. This particular case is pretty simple: ALTER TABLE oximeter.timeseries_schema ADD COLUMN target_schema_id would mostly work, but there's the question of defaults for existing data. This is one reason to choose SHA2: we can provide a default expression computing it from the relevant other columns. However, we may want to invest in tooling for general schema updates, similar to the schema-updater tool in Nexus.

This particular issue is made much more complicated by the move to clustered ClickHouse. Consider CockroachDB to start: one simply starts the database with a few different command-line arguments, but thereafter all interactions and all SQL statements work the same as a single-node case. This is in stark contrast to ClickHouse. To build replicated tables, one needs to provide pretty substantially different configuration files; launch additional processes (ClickHouse Keeper); and most importantly, the SQL itself is different. We started to handle this by asking the DB if a specific cluster exists, but I wonder if a more robust abstraction over single-node vs clustered deployments will be useful, especially for testing purposes. One could imagine a thin layer that takes the "single-node" SQL, plus a set of optional configuration parameters specifying the cluster details, and slightly rewrites the SQL to present either a single or distributed table that "looks" the same.

This is part of the point behind the existing replicated table structure. In the single-node case, we create a MergeTree-backed table with names like measurements_bool. In the multi-node case, we create a ReplicatedMergeTree-backed table with the name measurements_bool_local on every node; plus the Distributed table named measurements_bool. The point here is that the client only interacts with the table named measurements_bool, regardless of whether that's the single-node MergeTree table or a Distributed table.

This all requires some tooling to easily manage the deployment of ClickHouse though. We could imagine improving the existing ClickHouseInstance test utility to manage this, but I'm not sure how that interacts with the way we normally deploy things, via sled-agent / RSS.

Anyway, this is a lot of random thoughts, but I'm unfortunately not in a position to spend the time disentangling them all to make a logical work stream. It's also not obvious to me we need to do this now. We're limiting the amount of data in ClickHouse to 1 month, and until we have a bunch more metrics, it just doesn't take much space. The longer we wait, harder this might get, but I think we need to make sure we have a solid plan for making the single- to multi-node transition work well, in a way we can reliably test.

bnaecker requested review from ahl and smklein August 30, 2023 20:55

bnaecker force-pushed the add-timeseries-versions branch 4 times, most recently from 825e661 to 4885337 Compare August 30, 2023 22:09

bnaecker force-pushed the add-timeseries-versions branch from 4885337 to a98c5e4 Compare August 31, 2023 05:19

smklein reviewed Aug 31, 2023

View reviewed changes

oximeter/instruments/schema/http-service.json Show resolved Hide resolved

bnaecker closed this Sep 7, 2023

bnaecker deleted the add-timeseries-versions branch December 13, 2023 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds versions to timeseries schema #3996

Adds versions to timeseries schema #3996

Uh oh!

bnaecker commented Aug 30, 2023

Uh oh!

bnaecker commented Aug 30, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smklein Aug 31, 2023

Uh oh!

bnaecker Aug 31, 2023

Uh oh!

smklein Aug 31, 2023

Uh oh!

bnaecker Aug 31, 2023

Uh oh!

smklein Aug 31, 2023

Uh oh!

smklein Aug 31, 2023

Uh oh!

bnaecker Aug 31, 2023

Uh oh!

smklein Aug 31, 2023 •

edited

Loading

Uh oh!

bnaecker Sep 1, 2023

Uh oh!

Uh oh!

bnaecker commented Sep 7, 2023

Uh oh!

Uh oh!

Adds versions to timeseries schema #3996

Adds versions to timeseries schema #3996

Uh oh!

Conversation

bnaecker commented Aug 30, 2023

Uh oh!

bnaecker commented Aug 30, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smklein Aug 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bnaecker commented Sep 7, 2023

Uh oh!

Uh oh!

smklein Aug 31, 2023 •

edited

Loading