Skip to content

Support layernorm recompute for fused hstu layer#59

Merged
shijieliu merged 4 commits intoNVIDIA:mainfrom
JacoCheung:junzhang/recompute_layernorm
Jun 9, 2025
Merged

Support layernorm recompute for fused hstu layer#59
shijieliu merged 4 commits intoNVIDIA:mainfrom
JacoCheung:junzhang/recompute_layernorm

Conversation

@JacoCheung
Copy link
Copy Markdown
Collaborator

Description

This PR addresses #6 . Now only input layer norm is recomputed, which incurs slight ~1% perf drop in backward on A100-PCIe-80G given dim_per_heads=128, num_heads=4,seqlen=4096, batchsize=32, embedding_dim=512.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

please track CI status.

@JacoCheung JacoCheung requested a review from shijieliu June 5, 2025 01:51
Comment thread examples/commons/utils/clear_tensor_data.py
Comment thread examples/hstu/configs/hstu_config.py
Comment thread examples/hstu/ops/fused_hstu_op.py
@JacoCheung
Copy link
Copy Markdown
Collaborator Author

Updated CI

@shijieliu shijieliu merged commit f3b6798 into NVIDIA:main Jun 9, 2025
@shijieliu shijieliu mentioned this pull request Jun 12, 2025
3 tasks
@JacoCheung JacoCheung deleted the junzhang/recompute_layernorm branch February 2, 2026 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants