-
Notifications
You must be signed in to change notification settings - Fork 45
Adds versions to timeseries schema #3996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This resolves the first part of #1293, adding versions into just the traits used to describe metrics themselves. The next chunk is adding them to the database. That involves a schema change, and is somewhat blocked on (1) tooling to make such changes in ClickHouse (like #3760 ), and (2) the move to replicated tables in #3494. |
825e661
to
4885337
Compare
- Adds a version to the `Target` and `Metric` traits in `oximeter`. This is used to track the version of the schema defined by those types. It defaults to 1. - Adds better support tweaking the derived impls, allowing configuration of the name, description, version, and datum. This is partially to provide scaffolding for things like units in the future. - Adds utilities for testing that a schema change is accompanied by a version change. This is basically expectorate, tailored for this use case. - Adds a schema change test to the HTTP service metrics used in Nexus. - Makes the `TimeseriesName` a proper part of `oximeter` core, and includes the version numbers for the target / metric. These can be provided by a structured string like `target_name[@target_version]:metric_name[@metric_version]`, where the versions default to 1 to match the derived trait. The versions are _not_ yet stored in the DB, so we transfer an unversioned name to the DB for now.
4885337
to
a98c5e4
Compare
check_contents(path, actual) | ||
} | ||
|
||
fn check_contents( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not use the expectorate
crate itself? Why mimic the behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. The issue is that I don't want to just assert the contents are the same. I want to assert that, if they're different, the version number must have changed. I thought about catching the panic in expectorate
, but it just seemed complicated. I'll take another look though, I may be splitting hairs.
if expected_version.as_i64().unwrap() >= actual_version.as_i64().unwrap() | ||
&& !is_overwrite | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "expected >= actual && !overwrite", we throw an error. That makes sense to me.
But if "expected < actual", why don't we also throw an error? Why is that case allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was that you've properly increased the version along with a change to the schema. This is part of why I didn't just use expectorate
here: I want to check that the contents are the same, or if they're not, the version number is larger. Part of my worry was that it seems likely that folks just kinda put EXPECTORATE=overwrite
whenever they see a mismatch in this kind of situation. I was hoping it'd be better for folks to see an error that they need to bump the version, and then do that, at which point the update will happen automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it seems like setting EXPECTORATE=overwrite
obliterates the stored thing anyway -- so ironically, if someone skims the error, sees a diff, and sets that environment variable, they'll toss out the old stored JSON version and just use whatever value came from the in-memory struct regardless.
This could arguably even happen accidentally, with someone changing an unrelated openapi definition, but running EXPECTORATE=ovewrite cargo test
and trying to bump everything.
I know we're trying to go for a better user experience here, but I think the guidance is on "what you should change" is more subtle than "compare what happened to exist in this file before". I think this is complicated by the file not always representing "what is stored on upstream" -- If I update http-service.json
to use version=2
, and then update it again before merging, showing the diff with this file isn't quite right, I actually care about the diff from this file on main
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I am also interested in @ahl 's thoughts here, since our current policy everywhere else is:
- Store the schema explicitly
- Override the schema entirely
- Rely on humans to check the diff during code review, and confirm that these migrations are okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with all that, and I definitely waffled when implementing it. So the goal is to make sure that folks don't change the schema willy-nilly. If we do, we really want to make sure that (1) they are aware of that and (2) they've definitely bumped the version.
If we're relying on humans checking the diff, then straight-up expectorate
seems like the easiest / simplest approach. Adam's thoughts would definitely be helpful here.
If I update http-service.json to use version=2, and then update it again before merging, showing the diff with this file isn't quite right, I actually care about the diff from this file on main.
So for the multiple updates issue, we could explicitly compare with the checked in contents. I'm not sure if that's better, or just more magical though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was on the fence about whether or not to suggest that, but it does seem like it encapsulates what we care about:
- Does your "actual struct" match your "schema file"? (e.g., do you acknowledge you are changing the schema, in some way?)
- Does your schema change (diff with main) update the version, if any diff exists? (e.g., if you're changing the schema, are you doing it the right way?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked with Adam about some of this. He threw out a possibly crazy idea: suppose we don't have human-friendly versions for these, but we use a digest of the timeseries schema itself?
Some drawbacks
- Lose human-friendly version numbers. Kind of annoying to look at
target@111683034273293
instead oftarget@2
- The ordering of the versions has to be done by the time they're created (stored in the DB), not the versions
- There is no default. Since we're adding versions now, we can't really generate any meaningful default for those schema which lack versions. This is the big one, IMO. I think we'd need to just blow away data prior to this, or make 0 special and hope no schema actually hash to that.
- We need to pick a stable hash, something like a SHA sum, since the hash implementation in the Rust standard library could change.
Some benefits
- It's not really possible to change the schema and not change the version. We may still rely on a file for developers to ensure that their schema only changes when they want it to, but it's not possible to hit schema mismatches in the database when trying to insert new data.
- Developers don't have to worry about the version for the most part. It changes when it needs to.
This seems like it might make querying harder, but I don't believe it actually does. We can still support querying a specific version of the timeseries, a client just needs to use a bigger number to do that. But supporting a querying against the timeseries agnostic to the version is no different. We probably still want to go look at all the versions for the timeseries to resolve those that are consistent with the query (if any).
I've been experimenting with this for a few days, and I think it's time to actually just drop it. I wanted to include some thoughts. First, I like the idea of content-based versions. Hashing the relevant parts of the schema has a lot of nice properties. It can be made stable; it's automatic, so the developer doesn't need to manage the version manually; and it doesn't make querying any harder than if we support manual versioning. XX Hash is a good choice for non-crypto hashes. SHA2 is a good choice for crypto-hashes, although BLAKE3 is pretty interesting. One thing to consider here is if we need to compute these hashes in ClickHouse. Their implementation of XXHash64 is unfortunately a bit non-standard, and it's effectively impossible to guarantee we can compute the same values in Rust and CH. That would point towards a more widely-used standard like SHA2. BLAKE3 support in ClickHouse needs to be enabled at build-time, and our build of ClickHouse does not do so. That's an easy fix if we want to go that route. The second main issue I wanted to raise is around schema updates in the database itself. We probably want to add columns to the This particular issue is made much more complicated by the move to clustered ClickHouse. Consider CockroachDB to start: one simply starts the database with a few different command-line arguments, but thereafter all interactions and all SQL statements work the same as a single-node case. This is in stark contrast to ClickHouse. To build replicated tables, one needs to provide pretty substantially different configuration files; launch additional processes (ClickHouse Keeper); and most importantly, the SQL itself is different. We started to handle this by asking the DB if a specific cluster exists, but I wonder if a more robust abstraction over single-node vs clustered deployments will be useful, especially for testing purposes. One could imagine a thin layer that takes the "single-node" SQL, plus a set of optional configuration parameters specifying the cluster details, and slightly rewrites the SQL to present either a single or distributed table that "looks" the same. This is part of the point behind the existing replicated table structure. In the single-node case, we create a This all requires some tooling to easily manage the deployment of ClickHouse though. We could imagine improving the existing Anyway, this is a lot of random thoughts, but I'm unfortunately not in a position to spend the time disentangling them all to make a logical work stream. It's also not obvious to me we need to do this now. We're limiting the amount of data in ClickHouse to 1 month, and until we have a bunch more metrics, it just doesn't take much space. The longer we wait, harder this might get, but I think we need to make sure we have a solid plan for making the single- to multi-node transition work well, in a way we can reliably test. |
Target
andMetric
traits inoximeter
. This is used to track the version of the schema defined by those types. It defaults to 1.TimeseriesName
a proper part ofoximeter
core, and includes the version numbers for the target / metric. These can be provided by a structured string liketarget_name[@target_version]:metric_name[@metric_version]
, where the versions default to 1 to match the derived trait. The versions are not yet stored in the DB, so we transfer an unversioned name to the DB for now.