Skip to content

Commit 39b83b5

Browse files
committed
[CI] deepseek_v4_flash: run the full stack — triton indexer, compile, bs128
Switch the V4-Flash reference config from the 4-layer smoke setup to the full run: drop the ``num_hidden_layers = 4`` cap (use the release's 43 layers), select the fused Triton indexer top-k backend (``indexer_backend = "triton"``), turn ``compile_cfg`` on, and raise ``global_batch_size`` 16 -> 128.
1 parent 874bc3e commit 39b83b5

1 file changed

Lines changed: 5 additions & 4 deletions

File tree

ci/config/deepseek_v4_flash.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
# fields (num_hash_layers, swiglu_limit, attn_sink dims) are picked up from the
3636
# checkpoint instead of relying on the Config defaults.
3737
moe_cfg = DeepSeekV4Config.from_hf(DEEPSEEK_V4_PATH)
38-
moe_cfg.num_hidden_layers = 4
38+
# moe_cfg.num_hidden_layers = 4
3939
# V4 MTP forward is not wired yet (DeepSeekV4.build_mtp_block returns None), but
4040
# from_hf sets mtp_config from the release's num_nextn_predict_layers=1. Left as-is,
4141
# MoE.build_loss_ctx_batch keys off `mtp_config is not None` and builds MTP loss
@@ -90,16 +90,17 @@
9090
# (slower; see DSA._resolve_sparse_attn_fn). Must match the DataloaderConfig
9191
# pack_max_length below.
9292
moe_cfg.attention.pack_max_length = pack_max_length
93+
moe_cfg.attention.indexer_backend = "triton"
9394
# Compile is now safe — cutlass group_gemm is annotated with @torch.library.custom_op
9495
# (compile-friendly), and HC + DSA helpers are pure-Tensor.
9596
# Temporarily disabled: under pack=8192 + intra_layer_micro_batch=1 +
9697
# recompute_ratio=1.0 some backward path allocates a 130 GiB fp32 tensor.
9798
# The 06:00 run with compile_cfg=False reached step 50 at max_mem 114 GB so
9899
# the baseline fits — debug what compile_cfg=True is changing in the eager
99100
# code path that adds 130 GB on top.
100-
moe_cfg.compile_cfg = False
101+
moe_cfg.compile_cfg = True
101102

102-
optim_cfg = AdamWConfig(lr=6e-05)
103+
optim_cfg = AdamWConfig(lr=6e-05,)
103104
lr_cfg = LRConfig(lr_type="cosine", lr_min=1e-6)
104105
fsdp_cfg = FSDPConfig(
105106
# `FSDPConfig.torch_compile` is deprecated (1.1.0) and now acts as a master
@@ -150,7 +151,7 @@
150151
lr_cfg=lr_cfg,
151152
loss_cfg=loss_cfg,
152153
tokenizer_path=DEEPSEEK_V4_PATH,
153-
global_batch_size=16,
154+
global_batch_size=128,
154155
work_dir="/mnt/shared-storage-user/yehaochen/tmp",
155156
seed=0,
156157
strict_load=False,

0 commit comments

Comments
 (0)