add cookbook for Deepseek v4 pro PD disagg#23777
add cookbook for Deepseek v4 pro PD disagg#23777yan-lgtm wants to merge 1 commit intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds documentation and configuration for DeepSeek-V4 PD-Disagg on B200 clusters, including RDMA setup guides and memory optimizations to prevent OOM on 'big' models. Feedback suggests improving the RDMA troubleshooting script by iterating over all InfiniBand devices instead of hardcoding a single device name.
| for i in $(seq 0 7); do | ||
| v=$(cat /sys/class/infiniband/mlx5_0/ports/1/pkeys/$i 2>/dev/null) | ||
| [ -n "$v" ] && [ "$v" != "0x0000" ] && echo "pkey[$i]=$v" | ||
| done |
There was a problem hiding this comment.
The troubleshooting script hardcodes mlx5_0 when checking the pkey table. However, the preceding text mentions that mlx5_7 is the typical default for B200, and the previous script iterates over all mlx5_* devices. Hardcoding mlx5_0 here might lead users to inspect the wrong device. It is better to iterate over all devices to ensure the active one is covered.
for d in /sys/class/infiniband/mlx5_*; do
echo "--- $(basename $d) ---"
for i in $(seq 0 7); do
v=$(cat "$d/ports/1/pkeys/$i" 2>/dev/null)
[ -n "$v" ] && [ "$v" != "0x0000" ] && echo "pkey[$i]=$v"
done
done
Motivation
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci