Skip to content

Commit bb398ee

Browse files
JacoCheungclaudegeoffreyQiu
authored
fix: reduce Docker layers, add auto CI trigger, fix fake ops import (#363)
* fix: reduce Docker image layers to avoid overlay2 max depth limit Aggressively merge RUN instructions in the Dockerfile to reduce total layer count from ~126 to ~119. The inference image was hitting the overlay2 128-layer limit ("failed to register layer: max depth exceeded") on CI nodes. devel stage: 8 RUN + 1 COPY -> 4 RUN + 1 COPY (-4 layers) build stage: 4 RUN + 1 COPY -> 1 RUN + 1 COPY (-3 layers) FBGEMM and TorchRec kept as separate layers for build cache efficiency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci: add pull_request_target trigger for auto CI on PR open/sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix imports for fake ops wrapper used in expor * fix: remove invalid import of hstu.hstu_ops_gpu The module hstu.hstu_ops_gpu does not exist as a Python module. The C++ source hstu_ops_gpu.cpp compiles into hstu/fbgemm_gpu_experimental_hstu.so, not a separate hstu_ops_gpu submodule. This import was incorrectly added in PR #327 and causes ModuleNotFoundError in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update FBGEMM submodule to include hstu_ops_gpu.py fake impl Update from 04df536 to 65bad42 which adds fake tensor implementations for torch.export (hstu_ops_gpu.py). This was missing since PR #340 accidentally reverted the submodule pointer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci: allow /build with flags by matching prefix instead of exact string Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci: remove pull_request_target trigger, keep only /build comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Junyi Qiu <junyiq@nvidia.com>
1 parent c7b9ea2 commit bb398ee

5 files changed

Lines changed: 56 additions & 60 deletions

File tree

.github/workflows/blossom-ci.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,10 @@ jobs:
3333
outputs:
3434
args: ${{ env.args }}
3535

36-
# This job only runs for pull request comments
36+
# This job only runs for /build comments
3737
if: |
38-
github.event.comment.body == '/build' && contains(fromJson('["EmmaQiaoCh","JacoCheung","kanghui0204","jiashuy","shijieliu"]'), github.actor)
38+
contains(fromJson('["EmmaQiaoCh","JacoCheung","kanghui0204","jiashuy","shijieliu"]'), github.actor) &&
39+
startsWith(github.event.comment.body, '/build')
3940
steps:
4041
- name: Check if comment is issued by authorized person
4142
run: blossom-ci

docker/Dockerfile

Lines changed: 35 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -8,61 +8,50 @@ ARG TRITONSERVER_BUILD
88

99
WORKDIR /workspace/deps
1010

11+
# -- Layer 1: system setup, arch symlinks, tritonserver deps ---
1112
RUN if [ "${TRITONSERVER_BUILD}" = "1" ]; then \
1213
ln /bin/python3 /bin/python && \
13-
apt-get update -y --fix-missing && apt-get install -y cmake && apt-get install -y patchelf; \
14-
fi
15-
16-
RUN if [ "${TRITONSERVER_BUILD}" = "1" ]; then \
14+
apt-get update -y --fix-missing && apt-get install -y cmake patchelf && \
1715
pip3 install pandas rich cloudpickle psutil && \
1816
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130; \
19-
fi
20-
21-
RUN ARCH=$([ "${TARGETPLATFORM}" = "linux/arm64" ] && echo "aarch64" || echo "x86_64") && \
17+
fi && \
18+
ARCH=$([ "${TARGETPLATFORM}" = "linux/arm64" ] && echo "aarch64" || echo "x86_64") && \
2219
rm -rf /usr/lib/${ARCH}-linux-gnu/libnvidia-ml.so.1 && \
2320
if [ ${ARCH} = "aarch64" ]; then \
2421
ln -s /usr/local/cuda-13/targets/sbsa-linux/lib/stubs/libnvidia-ml.so /usr/lib/${ARCH}-linux-gnu/libnvidia-ml.so.1; \
2522
else \
2623
ln -s /usr/local/cuda-13/targets/${ARCH}-linux/lib/stubs/libnvidia-ml.so /usr/lib/${ARCH}-linux-gnu/libnvidia-ml.so.1; \
27-
fi
28-
29-
RUN git clone -b core_v0.12.1 https://github.com/NVIDIA/Megatron-LM.git megatron-lm && \
30-
pip install --no-deps -e ./megatron-lm
31-
32-
RUN pip install torchx gin-config torchmetrics==1.0.3 typing-extensions iopath pyvers
33-
RUN pip install cloudpickle
34-
RUN pip install triton==3.6.0
35-
RUN pip install nvidia-cutlass-dsl==4.3.0
36-
37-
RUN pip install --no-cache-dir setuptools-git-versioning scikit-build && \
38-
git clone --recursive -b v1.5.0 https://github.com/pytorch/FBGEMM.git fbgemm && \
39-
cd fbgemm/fbgemm_gpu && \
40-
python setup.py install --build-target=default --build-variant=cuda -DTORCH_CUDA_ARCH_LIST="7.5 8.0 9.0"
41-
42-
RUN pip install --no-deps tensordict orjson && \
43-
git clone --recursive -b release/V1.5.0 https://github.com/pytorch/torchrec.git torchrec && \
44-
cd torchrec && \
45-
pip install --no-deps .
46-
47-
48-
# for dev
49-
RUN apt update -y --fix-missing && \
50-
apt install -y gdb && \
51-
apt autoremove -y && \
52-
apt clean && \
53-
rm -rf /var/lib/apt/lists/*
54-
55-
RUN pip install --no-cache pre-commit
56-
57-
RUN if [ "${TARGETPLATFORM}" = "linux/arm64" ]; then \
24+
fi && \
25+
if [ "${TARGETPLATFORM}" = "linux/arm64" ]; then \
5826
CUDA_TARGET_ARCH=sbsa; \
5927
elif [ "${TARGETPLATFORM}" = "linux/amd64" ]; then \
6028
CUDA_TARGET_ARCH=x86_64; \
6129
else \
6230
CUDA_TARGET_ARCH=$(uname -m); \
6331
fi && \
6432
ln -sf /usr/local/cuda-13/targets/${CUDA_TARGET_ARCH}-linux/include/cccl/cuda \
65-
/usr/local/cuda/include/cuda
33+
/usr/local/cuda/include/cuda && \
34+
apt update -y --fix-missing && \
35+
apt install -y gdb && \
36+
apt autoremove -y && apt clean && rm -rf /var/lib/apt/lists/*
37+
38+
# -- Layer 2: pip dependencies + Megatron-LM ---
39+
RUN git clone -b core_v0.12.1 https://github.com/NVIDIA/Megatron-LM.git megatron-lm && \
40+
pip install --no-deps -e ./megatron-lm && \
41+
pip install torchx gin-config torchmetrics==1.0.3 typing-extensions iopath pyvers \
42+
cloudpickle triton==3.6.0 nvidia-cutlass-dsl==4.3.0 --no-cache pre-commit
43+
44+
# -- Layer 3: FBGEMM (long build, own layer for caching) ---
45+
RUN pip install --no-cache-dir setuptools-git-versioning scikit-build && \
46+
git clone --recursive -b v1.5.0 https://github.com/pytorch/FBGEMM.git fbgemm && \
47+
cd fbgemm/fbgemm_gpu && \
48+
python setup.py install --build-target=default --build-variant=cuda -DTORCH_CUDA_ARCH_LIST="7.5 8.0 9.0"
49+
50+
# -- Layer 4: TorchRec ---
51+
RUN pip install --no-deps tensordict orjson && \
52+
git clone --recursive -b release/V1.5.0 https://github.com/pytorch/torchrec.git torchrec && \
53+
cd torchrec && \
54+
pip install --no-deps .
6655

6756
# Install fbgemm_gpu_hstu (package: fbgemm_gpu_hstu, import: hstu) from submodule
6857
COPY third_party/FBGEMM /workspace/deps/fbgemm_hstu
@@ -84,23 +73,17 @@ WORKDIR /workspace/recsys-examples
8473
COPY . .
8574

8675
RUN cd /workspace/recsys-examples/corelib/dynamicemb && \
87-
python setup.py install
88-
89-
RUN cd /workspace/deps && rm -rf nvcomp && \
76+
python setup.py install && \
77+
cd /workspace/deps && rm -rf nvcomp && \
9078
wget https://developer.download.nvidia.com/compute/nvcomp/redist/nvcomp/linux-x86_64/nvcomp-linux-x86_64-5.1.0.21_cuda12-archive.tar.xz && \
9179
tar -xf nvcomp-linux-x86_64-5.1.0.21_cuda12-archive.tar.xz && \
9280
mv nvcomp-linux-x86_64-5.1.0.21_cuda12-archive nvcomp && \
93-
rm nvcomp-linux-x86_64-5.1.0.21_cuda12-archive.tar.xz
94-
95-
RUN cd /workspace/recsys-examples/examples/commons && \
96-
TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 9.0" python3 setup.py install
97-
98-
RUN if [ "${TRITONSERVER_BUILD}" != "1" ]; then \
81+
rm nvcomp-linux-x86_64-5.1.0.21_cuda12-archive.tar.xz && \
82+
cd /workspace/recsys-examples/examples/commons && \
83+
TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 9.0" python3 setup.py install && \
84+
if [ "${TRITONSERVER_BUILD}" != "1" ]; then \
9985
rm -f /usr/lib/$(uname -m)-linux-gnu/libcuda.so.1 && \
100-
ln -s /usr/local/cuda-13.1/compat/lib.real/libcuda.so.1 /usr/lib/$(uname -m)-linux-gnu/libcuda.so.1; \
101-
fi
102-
103-
RUN if [ "${TRITONSERVER_BUILD}" != "1" ]; then \
86+
ln -s /usr/local/cuda-13.1/compat/lib.real/libcuda.so.1 /usr/lib/$(uname -m)-linux-gnu/libcuda.so.1 && \
10487
cd /workspace/recsys-examples/corelib/dynamicemb && \
10588
mkdir -p torch_binding_build && cd torch_binding_build && \
10689
cmake .. && make -j; \

examples/hstu/modules/exportable_embedding.py

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,22 @@ def _load_inference_emb_ops() -> bool:
5656

5757

5858
# Load operators before register fake ops.
59-
_load_inference_emb_ops()
59+
# isort: off
60+
_load_inference_emb_ops() # registers torch.ops.INFERENCE_EMB.* before import dynamicemb
61+
import dynamicemb.index_range_meta as _index_range_meta # noqa: F401 – registers fake impls for torch.export
62+
import dynamicemb.lookup_meta as _lookup_meta # noqa: F401 – registers fake impls for torch.export
63+
64+
import hstu_cuda_ops # noqa: F401 – registers torch.ops.hstu_cuda_ops.*
65+
import commons.ops.cuda_ops.fake_hstu_cuda_ops # noqa: F401 – registers fake impls for torch.export
66+
67+
# isort: on
68+
69+
70+
# ---------------------------------------------------------------------------
71+
# ExportableEmbedding Module
72+
# ---------------------------------------------------------------------------
73+
74+
6075
from configs import InferenceEmbeddingConfig
6176
from dynamicemb import (
6277
DynamicEmbInitializerArgs,
@@ -66,10 +81,6 @@ def _load_inference_emb_ops() -> bool:
6681
from dynamicemb.exportable_tables import InferenceEmbeddingTable
6782
from torchrec.sparse.jagged_tensor import JaggedTensor, KeyedJaggedTensor
6883

69-
# ---------------------------------------------------------------------------
70-
# ExportableEmbedding Module
71-
# ---------------------------------------------------------------------------
72-
7384

7485
class ExportableEmbedding(torch.nn.Module):
7586
"""

examples/hstu/ops/fused_hstu_op.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
from typing import Optional, Tuple, Union
1818

1919
import hstu # noqa: F401 – registers torch.ops.fbgemm.*
20+
import hstu.hstu_ops_gpu # noqa: F401 – registers fake impls for torch.export
2021
import nvtx
2122
import torch
2223
from commons.utils.clear_tensor_data import clear_tensor_data

third_party/FBGEMM

Submodule FBGEMM updated 212 files

0 commit comments

Comments
 (0)