Skip to content

Some question about heartbeatConfVersion #384

@wangqiim

Description

@wangqiim

https://github.com/tidb-incubator/tinykv/blob/f050d8c1bde1dd210fdcd6e0dbb0739af5687669/kv/test_raftstore/scheduler.go#L335-L372

我在实现project3b时,有时会panic在上述代码的L371行。触发该panic的日志如下(加入两个再删除两个quorum)。

2022/03/30 09:44:40.848845 /home/wangqi/workplace/tinykv/kv/test_raftstore/peer_msg_handler.go:109: [info] [wq] Node [region 1] 7, add: 8 Region.ConfVer: 15
2022/03/30 09:44:40.848958 /home/wangqi/workplace/tinykv/kv/test_raftstore/peer_msg_handler.go:109: [info] [wq] Node [region 1] 7, add: 9 Region.ConfVer: 16
2022/03/30 09:44:40.849307 /home/wangqi/workplace/tinykv/kv/test_raftstore/peer_msg_handler.go:124: [info] [wq] Node [region 1] 7, remove: 8 Region.ConfVer: 17
2022/03/30 09:44:40.849519 /home/wangqi/workplace/tinykv/kv/test_raftstore/peer_msg_handler.go:124: [info] [wq] Node [region 1] 7, remove: 9 Region.ConfVer: 18

panic的原因是,在Conver为15和19时,总quorum不变,但是导致scheduler检验到region.conversion跳跃大于1,panic。
之后定位了一下触发区域心跳的位置,发现有如下两处:
(1)https://github.com/tidb-incubator/tinykv/blob/course/kv/raftstore/peer_msg_handler.go#L511-L518
(2)https://github.com/tidb-incubator/tinykv/blob/course/kv/raftstore/peer_msg_handler.go#L202-L204
对于第(1)处,是由时钟触发
对于第(2)处,是当addnode的节点(pending node)追上leader的truncate时触发
对于如下情况:leader和其他大多数节点达成同步,可以直接忽视pending node,进行apply log(addnode和removenode请求),则在当前version,就无法触发(2),则只能等到时钟timeout(1)时才有机会触发区域心跳,这样scheduler就可能检测到region.conversion跳跃大于1,之后panic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions