Skip to content

nanoT5 initializes lm_head weights with 768x too much variance, probably #25

@Birch-san

Description

@Birch-san

it looks like your lm_head weight init is the same as HF's, which has a known problem meaning it doesn't match the original mesh tensorflow weight init:
huggingface/transformers#26441

I believe that when training T5 with an untied lm_head, you would want to initialize the lm_head weights with std=hidden_dim**-.5. so about std=0.036.

currently the lm_head is inited with std=1, which is 27.7x too much std or 768x too much variance.
https://github.com/PiotrNawrot/nanoT5/blob/1c82d67bf8dea635be68a3b2a68a43b68b665193/nanoT5/utils/t5_model.py#L505C26-L505C26

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions