Skip to content

ALTER DEFAULT PRIVILEGES: intermittent error about primary key constraint violation #90803

Closed
@davepacheco

Description

@davepacheco

Describe the problem

I'm not sure how we got our database into the following state. It's not reproducible for us from a clean slate (i.e., as our software deploys the database), but it's somewhat reproducible through some of our automation.

Here's the problem: we try to run ALTER DEFAULT PRIVILEGES FOR ROLE root REVOKE ALL ON TABLES FROM omicron;. Sometimes it works, and sometimes it produces a confusing error message. Here's a sequence running the command one right after another (just a few tens of milliseconds between queries):

root@[fd00:1122:3344:101::2]:32221/defaultdb> ALTER DEFAULT PRIVILEGES FOR ROLE root REVOKE ALL ON TABLES FROM omicron;
ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
SQLSTATE: 23505
DETAIL: Key (id)=(1441793) already exists.
CONSTRAINT: primary
root@[fd00:1122:3344:101::2]:32221/defaultdb> ALTER DEFAULT PRIVILEGES FOR ROLE root REVOKE ALL ON TABLES FROM omicron;
ALTER DEFAULT PRIVILEGES


Time: 47ms total (execution 47ms / network 0ms)

root@[fd00:1122:3344:101::2]:32221/defaultdb> ALTER DEFAULT PRIVILEGES FOR ROLE root REVOKE ALL ON TABLES FROM omicron;
ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
SQLSTATE: 23505
DETAIL: Key (id)=(1507329) already exists.
CONSTRAINT: primary
root@[fd00:1122:3344:101::2]:32221/defaultdb> ALTER DEFAULT PRIVILEGES FOR ROLE root REVOKE ALL ON TABLES FROM omicron;
ALTER DEFAULT PRIVILEGES


Time: 54ms total (execution 53ms / network 0ms)

root@[fd00:1122:3344:101::2]:32221/defaultdb> ALTER DEFAULT PRIVILEGES FOR ROLE root REVOKE ALL ON TABLES FROM omicron;
ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
SQLSTATE: 23505
DETAIL: Key (id)=(1572865) already exists.
CONSTRAINT: primary
root@[fd00:1122:3344:101::2]:32221/defaultdb> ALTER DEFAULT PRIVILEGES FOR ROLE root REVOKE ALL ON TABLES FROM omicron;
ALTER DEFAULT PRIVILEGES


Time: 39ms total (execution 39ms / network 0ms)

root@[fd00:1122:3344:101::2]:32221/defaultdb> ALTER DEFAULT PRIVILEGES FOR ROLE root REVOKE ALL ON TABLES FROM omicron;
ALTER DEFAULT PRIVILEGES


Time: 49ms total (execution 49ms / network 0ms)

root@[fd00:1122:3344:101::2]:32221/defaultdb> ALTER DEFAULT PRIVILEGES FOR ROLE root REVOKE ALL ON TABLES FROM omicron;
ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
SQLSTATE: 23505
DETAIL: Key (id)=(1671169) already exists.
CONSTRAINT: primary
root@[fd00:1122:3344:101::2]:32221/defaultdb> 

First, it's not clear what this error message means in the context of this query. It seems like something internal bubbling out? But it's also not clear why it only happens sometimes, when we haven't changed the state of the system. I believe that while I'm doing this, there are no other connected sessions.

On a whim: I restarted CockroachDB from this state and now every attempt to run that SQL results in the constraint violation error. After a few more minutes (writing up this report), I tried again, and we're back to an intermittent result.

To Reproduce

It's hard to get you the sequence of steps from a fresh cluster because this is the result of a bunch of automation. But once we're here, just running that SQL query reproduces the problem.

Expected behavior

We expected that query to complete successfully, or maybe fail with an error message that has something to do with what's wrong with it.

Additional data / screenshots

Let me know what other information would be useful.

Environment:

This is a single-node CockroachDB cluster.

Build Tag:        v21.2.9
Build Time:       2022/04/28 04:02:42
Distribution:     OSS
Platform:         illumos amd64 (x86_64-pc-solaris2.11)
Go Version:       go1.16.10
C Compiler:       gcc 10.3.0
Build Commit ID:  11787edfcfc157a0df951abc34684e4e18b3ef20
Build Type:       release

Note we're running an OSS build on Helios (our illumos distro).

Additional context

The context here is that we have some automation that sets up a single-node cluster and populates it with this schema that defines one database and one user:
https://github.com/oxidecomputer/omicron/blob/main/common/src/sql/dbinit.sql

We also have a mechanism that's intended to wipe the database to a pristine state with:
https://github.com/oxidecomputer/omicron/blob/main/common/src/sql/dbwipe.sql

It's that second step that's failing in the way reported here. We stopped the automation at that point, logged in interactively, and found the behavior described above.

The impact is that our automation fails to reliably wipe the database in this case. This is not recoverable in an automatable way, at least not without knowing what we should do if we find this situation.

Jira issue: CRDB-20949

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-communityOriginated from the communityT-sql-foundationsSQL Foundations Team (formerly SQL Schema + SQL Sessions)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions