Fix ESM2 + Geneformer regression #863

dorotat-nv · 2025-05-08T16:04:57Z

Description

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

SKIP_CI - Skip all continuous integration tests
INCLUDE_NOTEBOOKS_TESTS - Execute notebook validation tests in pytest
INCLUDE_SLOW_TESTS - Execute tests labelled as slow in pytest for extensive testing

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Usage

TODO: Add code snippet

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

…c8ab20485edaaae820ddd3c1b88

…6f45ba5c0b4d7791566055d3eee

codecov-commenter · 2025-05-08T17:09:37Z

❌ 10 Tests Failed:

Tests completed	Failed	Passed	Skipped
1002	10	992	22

View the top 3 failed test(s) by shortest run time

sub-packages/bionemo-evo2/tests/bionemo/evo2/test_hyena_operators.py::TestParallelHyenaOperator::test_initialization

Stack Traces | 0.063s run time

self = <test_hyena_operators.TestParallelHyenaOperator object at 0x7f12b3f08440>
transformer_config = TransformerConfig(tensor_model_parallel_size=1, pipeline_model_parallel_comm_backend=None, pipeline_model_parallel_siz...groups=8, mamba_num_heads=None, use_mamba_mem_eff_path=True, mlp_chunks_for_prefill=1, heterogeneous_block_specs=False)
hyena_config = HyenaConfig(tie_projection_weights=False, to_upper='normalized_weighted', lowercase_loss_reweighting=0.1, short_conv_L...onv_mixer=False, hyena_short_conv_pregate=True, hyena_short_conv_postgate=True, proj_groups=1, grouped_attention=False)

    @pytest.fixture
    def operator(self, transformer_config: TransformerConfig, hyena_config: HyenaConfig) -> ParallelHyenaOperator:
        with megatron_parallel_state_utils.distributed_model_parallel_state():
>           yield ParallelHyenaOperator(
                hidden_size=transformer_config.hidden_size,
                transformer_config=transformer_config,
                hyena_config=hyena_config,
                max_sequence_length=1024,
                operator_type="hyena_medium_conv",
                init_method="small_init",
            )

.../bionemo/evo2/test_hyena_operators.py:47: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../megatron/hyena/hyena_utils.py:797: in __init__
    self.conv_bias.data = conv_init_method(self.conv_bias.data)
.../local/lib/python3.12.../torch/nn/init.py:166: in uniform_
    return _no_grad_uniform_(tensor, a, b, generator)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

tensor = tensor([-0.0000e+00, -7.8740e-05, -1.5748e-04, -2.3622e-04, -3.1496e-04,
        -3.9370e-04, -4.7244e-04, -5.5118e-04...3, -7.4713e-03, -7.5552e-03, -7.6392e-03,
        -7.7231e-03, -7.8071e-03, -7.8910e-03, -7.9750e-03], device='cuda:0')
a = -0.08838834764831845, b = 0.08838834764831845, generator = None

    def _no_grad_uniform_(tensor, a, b, generator=None):
        with torch.no_grad():
>           return tensor.uniform_(a, b, generator=generator)
E           RuntimeError: Offset increment outside graph capture encountered unexpectedly.

.../local/lib/python3.12.../torch/nn/init.py:17: RuntimeError

sub-packages/bionemo-evo2/tests/bionemo/evo2/test_hyena_operators.py::TestParallelCausalDepthwiseConv1d::test_gpu_forward

Stack Traces | 0.064s run time

self = <test_hyena_operators.TestParallelCausalDepthwiseConv1d object at 0x7f12b3f0a870>
transformer_config = TransformerConfig(tensor_model_parallel_size=1, pipeline_model_parallel_comm_backend=None, pipeline_model_parallel_siz...groups=8, mamba_num_heads=None, use_mamba_mem_eff_path=True, mlp_chunks_for_prefill=1, heterogeneous_block_specs=False)
hyena_config = HyenaConfig(tie_projection_weights=False, to_upper='normalized_weighted', lowercase_loss_reweighting=0.1, short_conv_L...onv_mixer=False, hyena_short_conv_pregate=True, hyena_short_conv_postgate=True, proj_groups=1, grouped_attention=False)

    @pytest.fixture
    def operator(
        self, transformer_config: TransformerConfig, hyena_config: HyenaConfig
    ) -> ParallelCausalDepthwiseConv1d:
        with megatron_parallel_state_utils.distributed_model_parallel_state():
>           yield ParallelCausalDepthwiseConv1d(
                d_model=transformer_config.hidden_size,
                transformer_config=transformer_config,
                hyena_config=hyena_config,
                kernel_size=hyena_config.short_conv_L,
                init_method=transformer_config.init_method,
                bias=hyena_config.conv_proj_bias,
                use_fast_causal_conv=hyena_config.fast_conv_proj,
            )

.../bionemo/evo2/test_hyena_operators.py:167: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../megatron/hyena/hyena_utils.py:1070: in __init__
    initialize_affine_weight_gpu(self.short_conv_weight, conv_init_method, partition_dim=0)
.../local/lib/python3.12.../megatron/hyena/hyena_utils.py:688: in initialize_affine_weight_gpu
    init_method(weight.data)  # modify the data in place
.../local/lib/python3.12.../torch/nn/init.py:166: in uniform_
    return _no_grad_uniform_(tensor, a, b, generator)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

tensor = tensor([[-0.0000, -0.0002, -0.0003],
        [-0.0005, -0.0006, -0.0008],
        [-0.0009, -0.0011, -0.0012],
       ...-0.0043, -0.0045, -0.0047],
        [-0.0049, -0.0051, -0.0053],
        [-0.0055, -0.0057, -0.0059]], device='cuda:0')
a = -0.5773502691896257, b = 0.5773502691896257, generator = None

    def _no_grad_uniform_(tensor, a, b, generator=None):
        with torch.no_grad():
>           return tensor.uniform_(a, b, generator=generator)
E           RuntimeError: Offset increment outside graph capture encountered unexpectedly.

.../local/lib/python3.12.../torch/nn/init.py:17: RuntimeError

sub-packages/bionemo-evo2/tests/bionemo/evo2/test_hyena_operators.py::TestParallelShortHyenaOperatorWithConvBias::test_gpu_forward

Stack Traces | 0.065s run time

self = <test_hyena_operators.TestParallelShortHyenaOperatorWithConvBias object at 0x7f12b3f08aa0>
transformer_config = TransformerConfig(tensor_model_parallel_size=1, pipeline_model_parallel_comm_backend=None, pipeline_model_parallel_siz...groups=8, mamba_num_heads=None, use_mamba_mem_eff_path=True, mlp_chunks_for_prefill=1, heterogeneous_block_specs=False)
hyena_config = HyenaConfig(tie_projection_weights=False, to_upper='normalized_weighted', lowercase_loss_reweighting=0.1, short_conv_L...onv_mixer=False, hyena_short_conv_pregate=True, hyena_short_conv_postgate=True, proj_groups=1, grouped_attention=False)

    @pytest.fixture
    def operator(self, transformer_config: TransformerConfig, hyena_config: HyenaConfig) -> ParallelShortHyenaOperator:
        with megatron_parallel_state_utils.distributed_model_parallel_state():
>           yield ParallelShortHyenaOperator(
                hidden_size=transformer_config.hidden_size,
                transformer_config=transformer_config,
                hyena_config=hyena_config,
                init_method="small_init",
                short_conv_class=ParallelCausalDepthwiseConv1d,
                use_fast_causal_conv=False,
                local_init=False,
                use_conv_bias=True,
            )

.../bionemo/evo2/test_hyena_operators.py:125: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../megatron/hyena/hyena_utils.py:934: in __init__
    self.short_conv = short_conv_class(
.../local/lib/python3.12.../megatron/hyena/hyena_utils.py:1070: in __init__
    initialize_affine_weight_gpu(self.short_conv_weight, conv_init_method, partition_dim=0)
.../local/lib/python3.12.../megatron/hyena/hyena_utils.py:688: in initialize_affine_weight_gpu
    init_method(weight.data)  # modify the data in place
.../local/lib/python3.12.../torch/nn/init.py:166: in uniform_
    return _no_grad_uniform_(tensor, a, b, generator)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

tensor = tensor([[[-0.0000, -0.0004, -0.0009,  ..., -0.0017, -0.0021, -0.0026]],

        [[-0.0030, -0.0034, -0.0038,  ..., -0..., -0.0161, -0.0168]],

        [[-0.0176, -0.0183, -0.0190,  ..., -0.0204, -0.0211, -0.0218]]],
       device='cuda:0')
a = -0.5773502691896257, b = 0.5773502691896257, generator = None

    def _no_grad_uniform_(tensor, a, b, generator=None):
        with torch.no_grad():
>           return tensor.uniform_(a, b, generator=generator)
E           RuntimeError: Offset increment outside graph capture encountered unexpectedly.

.../local/lib/python3.12.../torch/nn/init.py:17: RuntimeError

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

…117b2057ced6fa5be583e35cb7b

NeMo:b685967f9512e1906e11fbd95048ff0fb05ff2fe, Megatron:fb7c3f8718397…

83e9d30

…c8ab20485edaaae820ddd3c1b88

dorotat-nv requested review from jstjohn, malcolmgreaves, pstjohn, trvachov, sichu2023, skothenhill-nv, jomitchellnv, jwilber and cspades as code owners May 8, 2025 16:04

NeMo:6a78ab8cf0c67f4d704a245e2b4243dfd968019b, Megatron:62529f1d8e3d7…

5b5f02d

…6f45ba5c0b4d7791566055d3eee

NeMo:061e3d7b37baf39ea8cfdaf3a4659d119866f535, Megatron:54afff44f29e1…

a212bad

…117b2057ced6fa5be583e35cb7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ESM2 + Geneformer regression #863

Fix ESM2 + Geneformer regression #863

Uh oh!

dorotat-nv commented May 8, 2025

Uh oh!

codecov-commenter commented May 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix ESM2 + Geneformer regression #863

Are you sure you want to change the base?

Fix ESM2 + Geneformer regression #863

Uh oh!

Conversation

dorotat-nv commented May 8, 2025

Description

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Usage

Pre-submit Checklist

Uh oh!

codecov-commenter commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 10 Tests Failed:

Uh oh!

Uh oh!

codecov-commenter commented May 8, 2025 •

edited

Loading