[NEW] Server driven slot migration

Continuation of https://github.com/valkey-io/valkey/pull/245#discussion_r1588966176

Today, slot migration is completely driven by an external process, essentially executing the steps below:

1. On the destination node
`CLUSTER SETSLOT <slot> IMPORTING <source_node_id>`

2. On the source node
`CLUSTER SETSLOT <slot> MIGRATING <destination_node_id>`

3. Get keys and migrate them one by one
`CLUSTER GETKEYSINSLOT <slot> <count>MIGRATE <destination_ip> <destination_port> <key> 0 <timeout>`

4. Set the slot to the destination node on all nodes
`CLUSTER SETSLOT <slot> NODE <destination_node_id>`

This is a heavy-handed process with many failure paths to handle. Even with the improvements introduced in #445, step 3 above is still error-prone.

The proposal here is to introduce a new command that allows the entire process to be executed on the migration source node in one shot. We can relatively easily perform all the steps above in the engine for now, but going forward, this change also serves as a stepping stone to the eventual atomic slot migration (#23).

On a high level, here is what the proposed workflow would look like:

1. Initiate slot migration
`CLUSTER MIGRATE QUEUE <SLOTS> <SHARD_ID>`, where <SLOTS> is a comma-separated unordered list of slot ranges or single slots, such as `3-6,7,10,1`. Note that `<SHARD_ID>` is a preferred target identifier instead of `<NODE_ID>`. This is to relieve the client of the hassle of tracking down the primary node, which is a volatile state on its own and can change right after the client query.

This command is also non-blocking, like `CLUSTER FAILOVER`.

2. Check slot migration results

Finding if the slots were migrated successfully or not can be achieved via any of the cluster topology query commands. However, regardless of how the slot migration is performed (atomic or not), errors can happen. There is a need for the client to get more information about any incomplete migration. The detailed implementation is not a concern at this point, but the user interface is key. Because there will be a need for the client to cancel in-progress or pending slot migrations, it is desired to have an ability to report per-slot migration results. For this reason, we could consider a command like the below:

`CLUSTER MIGRATE REPORT <SLOTS>` where `<SLOTS>` are optional. When `<SLOTS>` is not provided, this command reports all in-progress and pending-migration slots.

The report is an array with each element being a map with one of the following two sets of fields:

a. on source
`slot_number, target_shard_id, state (started/pending/failed), num_retries, queued_time, start_time, update_time`

b. on target
`slot_number, source_shard_id, start_time, update_time`

3. Cancel an in-progress or pending slot migration

`CLUSTER MIGRATE CANCEL <SLOTS>`

Note that this proposal allows the future atomic slot migration improvement to be introduced as a drop-in replacement of the existing migration scheme.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] Server driven slot migration #587

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[NEW] Server driven slot migration #587

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions