Skip to content

adapter: Add coordinator consistency check #21740

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

andrewrodriguez-m
Copy link
Contributor

@andrewrodriguez-m andrewrodriguez-m commented Sep 13, 2023

Motivation

Piggybacking on #21717, adding consistency checks to the coordinator that get run in dev and during testdrive

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

Copy link
Contributor

@jkosh44 jkosh44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'm curious to see what happens in CI.

Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@def-
Copy link
Contributor

def- commented Sep 14, 2023

Whole bunch of failures and panics in nightly, seems related to this change: https://buildkite.com/materialize/nightlies/builds/4009

testdrive-materialized-1     | thread 'coordinator' panicked at 'assertion failed: `(left == right)`
testdrive-materialized-1     |   left: `Err(CoordinatorInconsistencies { catalog_inconsistencies: CatalogInconsistencies { internal_fields: [], roles: [], comments: [], object_dependencies: [] }, read_capabilities: [Compute(System(667)), Compute(System(668)), Compute(System(669)), Compute(System(670)), Compute(System(671)), Compute(System(672)), Compute(System(673)), Compute(System(674)), Compute(System(675)), Compute(System(676)), Compute(System(677)), Compute(System(678)), Compute(System(679)), Compute(System(680)), Compute(System(681)), Compute(System(682)), Compute(System(683)), Compute(System(684)), Compute(System(685)), Compute(System(686)), Compute(System(687)), Compute(System(688)), Compute(System(689)), Compute(System(690)), Compute(System(691)), Compute(System(692)), Compute(System(693)), Compute(System(694)), Compute(System(695)), Compute(System(696)), Compute(System(697)), Compute(System(698)), Compute(System(699)), Compute(System(700)), Compute(System(701)), Compute(System(702)), Compute(System(703)), Compute(System(704)), Compute(System(705)), Compute(System(706)), Compute(System(707)), Compute(System(708)), Compute(System(709)), Compute(System(710)), Compute(System(711)), Compute(System(712)), Compute(System(713)), Compute(System(714))] })`,
testdrive-materialized-1     |  right: `Ok(())`', /var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-061d917ef8cdf048c-1/materialize/tests/src/adapter/src/coord/ddl.rs:519:9

Copy link
Contributor

@ParkMyCar ParkMyCar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's not good that we found inconsistencies, it's great to know that this is working!

state.materialize_internal_http_addr,
))
.await?
.text()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to call .json() here which will parse the body as JSON

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for that. I was thinking about leaving it as is b/c i've added more context when each step can fail which has been helpful while debugging

def-
def- previously requested changes Sep 14, 2023
Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be submitted while it makes nightly that red.

Comment on lines 446 to 455
for id in self
.catalog()
.get_cluster(cluster_id)
.log_indexes
.values()
.cloned()
.collect_vec()
{
self.drop_compute_read_policy(&id);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two notes:

  1. I would move this to before we drop the cluster. It's possible that drop_cluster wipes out some metadata needed to update the read policies of the log indexes.
  2. You'll probably need to collect the log indexes before executing the catalog transaction. Since the transaction completed and this cluster was dropped, the cluster won't exist in the catalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@shepherdlybot
Copy link

shepherdlybot bot commented Sep 26, 2023

This PR has higher risk. Make sure to carefully review the file hotspots. In addition to having this change reviewed, adequate tests should be considered and it may be useful to add observability and/or a feature flag. What's This?

Risk Score Probability Buggy File Hotspots
🔴 80 / 100 60% 3
Buggy File Hotspots:
File Percentile
../coord/command_handler.rs 95
../src/coord.rs 99
../src/catalog.rs 100

@andrewrodriguez-m
Copy link
Contributor Author

Shouldn't be submitted while it makes nightly that red.

@def- i think we're good now

@andrewrodriguez-m andrewrodriguez-m dismissed def-’s stale review September 27, 2023 16:58

addressed test failures

@andrewrodriguez-m andrewrodriguez-m merged commit 8568f9d into MaterializeInc:main Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants