You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem: a customer confirmed that CD convergence is still slow for large node counts -- see #816. Solutions: in November 2025, after our last round of improvements, we took note of two solution strategies that we would look into once necessary:
Any kind of sharding on a per-clique level may be super useful. And/or server-side applies.
We went ahead with server-side applies (SSA) (in #822). Subsequently, we measured that SSA-based convergence actually performs and scales worse than the pre-SSA convergence method. We then started to explore per-clique sharding (in #826). Initial measurements suggest that it yields the desired scaling behavior. A selection of measurement results are shown in the plot(s) below.
Convergence time over node count for different convergence methods
Main conclusions:
The gray data points correspond to the pre-SSA method -- it is roughly a straight line (in a log-log plot) and hence confirms exponential growth.
The dashed lines represent SSA / SSA-with-fixes; they show that SSA performs worse than the pre-SSA method for more than just a hand full of nodes (we tested the SSA patch CD daemon: use SSA for conflict-free nodes list updates #822 with just four nodes before merging).
We still need to measure larger N -- towards O(10**4). The per-clique sharding technique effectively relies on being able to make thousands of independent write requests to the API server per second. The API server seemingly can do it, but of course depending on how exactly it's deployed and other workload in the cluster there will be a natural point of contention.
Each data point above corresponds to one measurement. There of course is variance across repetitions which we did not thoroughly measure. The main conclusions with respect to the scaling behavior of the different methods are likely to be robust. In the future, we'll measure variance through repetitions. "Eine Messung ist keine Messung".
Appendix: same plot, using linear scales instead of a log-log representation (click to enlarge):
Problem: a customer confirmed that CD convergence is still slow for large node counts -- see #816. Solutions: in November 2025, after our last round of improvements, we took note of two solution strategies that we would look into once necessary:
We went ahead with server-side applies (SSA) (in #822). Subsequently, we measured that SSA-based convergence actually performs and scales worse than the pre-SSA convergence method. We then started to explore per-clique sharding (in #826). Initial measurements suggest that it yields the desired scaling behavior. A selection of measurement results are shown in the plot(s) below.
Convergence time over node count for different convergence methods
Main conclusions:
Caveats:
Appendix: same plot, using linear scales instead of a log-log representation (click to enlarge):
