Skip to content

<fix>[core]: synchronize consistent hash ring to prevent dual-MN race condition#3332

Open
MatheMatrix wants to merge 1 commit into5.5.6from
sync/ye.zou/fix/ZSTAC-77711
Open

<fix>[core]: synchronize consistent hash ring to prevent dual-MN race condition#3332
MatheMatrix wants to merge 1 commit into5.5.6from
sync/ye.zou/fix/ZSTAC-77711

Conversation

@MatheMatrix
Copy link
Owner

Summary

  • ZSTAC-77711: 双 MN 一致性哈希环出现不一致,消息路由到错误 MN,导致 UI 任务卡顿
  • 根因:nodeJoin/nodeLeft/iJoin 和 makeDestination 等方法无同步,心跳线程与事件线程并发修改 nodeHash 和 nodes
  • 修复:所有读写 nodeHash/nodes 的方法加 synchronized lock,getManagementNodesInHashRing/getAllNodeInfo 返回防御性拷贝
  • 额外修复 getNodeInfo 中 nodes.put() 返回值 bug

Files Changed

  • ResourceDestinationMakerImpl.java — synchronized lock on all methods

Resolves: ZSTAC-77711

sync from gitlab !9154

@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

Walkthrough

对管理节点哈希环和节点信息的访问进行了同步保护:在 ResourceDestinationMakerImpl 中将多处方法改为 synchronized,Portal 的 ManagementNodeManagerImpl 增加 lifecycleLock、suspectedMissingFromDb 并调整心跳/生命期事件的对齐与缺失确认流程。

Changes

Cohort / File(s) Summary
核心:资源目的地与节点信息同步保护
core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java
将涉及节点哈希环和节点映射的多方法(nodeJoin/nodeLeft/iAmDead/iJoin/makeDestination/isManagedByUs/getManagementNodesInHashRing/getNodeInfo/getAllNodeInfo/getManagementNodeCount/isNodeInCircle)改为 synchronized,并返回集合或对象的副本以防外部修改。需关注同步对并发性能和死锁风险的影响。
生命周期与心跳对齐改动
portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java
新增 lifecycleLock 用于序列化生命周期事件与心跳重整,添加 suspectedMissingFromDb 跟踪疑似在 DB 中缺失的节点;健康重整改为两轮确认后才从哈希环移除节点,并在 nodeJoin 时清理疑似列表。关注并发序列化对事件延迟及潜在竞态的处理。

Sequence Diagram(s)

sequenceDiagram
    participant HeartbeatReconciler as HeartbeatReconciler
    participant Manager as ManagementNodeManagerImpl
    participant DB as Database
    participant HashRing as NodeHashRing

    HeartbeatReconciler->>Manager: start reconciliation()
    alt acquire lifecycleLock
        Manager->>Manager: synchronized(lifecycleLock)
        Manager->>HashRing: list nodes in hash ring
        loop for each node
            Manager->>DB: query node by uuid
            alt DB has node
                DB-->>Manager: node exists
                Manager->>HashRing: ensure node present
                Manager->>Manager: suspectedMissingFromDb.remove(node)
            else DB missing node
                DB-->>Manager: not found
                Manager->>Manager: mark node in suspectedMissingFromDb (first round)
                alt second consecutive round (still missing)
                    Manager->>HashRing: remove node from hash ring
                    Manager->>Manager: suspectedMissingFromDb.remove(node)
                end
            end
        end
    end
Loading

预估代码审查工作量

🎯 4 (复杂) | ⏱️ ~45 分钟

🐰 新锁轻敲,环上步履稳,
我在草丛数着节点的心,
两轮确认风雨后方归位,
同步与守护,世界悄然安宁。 🥕


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 2 warnings)
Check name Status Explanation Resolution
Title check ❌ Error PR title exceeds the 72-character limit at 79 characters, violating the formatting requirements. Shorten the title to 72 characters or less while maintaining clarity, e.g., '[core]: Synchronize hash ring to prevent dual-MN race condition'.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (6 files):

⚔️ build/pom.xml (content)
⚔️ core/pom.xml (content)
⚔️ core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java (content)
⚔️ identity/src/main/java/org/zstack/identity/QuotaUtil.java (content)
⚔️ network/src/main/java/org/zstack/network/service/DhcpExtension.java (content)
⚔️ portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java (content)

These conflicts must be resolved before merging into 5.5.6.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The description is clearly related to the changeset, providing issue context, root cause analysis, and implementation details matching the code changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sync/ye.zou/fix/ZSTAC-77711
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch sync/ye.zou/fix/ZSTAC-77711
  • Create stacked PR with resolved conflicts
  • Post resolved changes as copyable diffs in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

…talling

In dual management node scenarios, concurrent modifications to the
consistent hash ring from heartbeat reconciliation and canonical event
callbacks can cause NodeHash/Nodes inconsistency, leading to message
routing failures and task timeouts.

Fix: (1) synchronized all ResourceDestinationMakerImpl methods to
ensure atomic nodeHash+nodes updates, (2) added lifecycleLock in
ManagementNodeManagerImpl to serialize heartbeat reconciliation with
event callbacks, (3) added two-round delayed confirmation before
removing nodes from hash ring to avoid race with NodeJoin events.

Resolves: ZSTAC-77711

Change-Id: I3d33d53595dd302784dff17417a5b25f2d0f3426
@MatheMatrix MatheMatrix force-pushed the sync/ye.zou/fix/ZSTAC-77711 branch from e8732a5 to 312bd83 Compare February 15, 2026 17:13
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java (1)

80-93: ⚠️ Potential issue | 🔴 Critical

修复 getNodeInfo 中 put 返回值导致的空返回。

Map.put 返回的是旧值,当前写法会让 info 变为 null 并直接返回,功能错误。应先创建 NodeInfo,再 put,并返回新对象。

🔧 建议修复
         if (info == null) {
             ManagementNodeVO vo = dbf.findByUuid(nodeUuid, ManagementNodeVO.class);
             if (vo == null) {
                 throw new ManagementNodeNotFoundException(nodeUuid);
             }

-            nodeHash.add(nodeUuid);
-            info = nodes.put(nodeUuid, new NodeInfo(vo));
+            info = new NodeInfo(vo);
+            nodeHash.add(nodeUuid);
+            nodes.put(nodeUuid, info);
         }

         return info;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants