it looks like your lm_head weight init is the same as HF's, which has a known problem meaning it doesn't match the original mesh tensorflow weight init:
huggingface/transformers#26441
I believe that when training T5 with an untied lm_head, you would want to initialize the lm_head weights with std=hidden_dim**-.5. so about std=0.036.
currently the lm_head is inited with std=1, which is 27.7x too much std or 768x too much variance.
https://github.com/PiotrNawrot/nanoT5/blob/1c82d67bf8dea635be68a3b2a68a43b68b665193/nanoT5/utils/t5_model.py#L505C26-L505C26
it looks like your lm_head weight init is the same as HF's, which has a known problem meaning it doesn't match the original mesh tensorflow weight init:
huggingface/transformers#26441
I believe that when training T5 with an untied lm_head, you would want to initialize the lm_head weights with
std=hidden_dim**-.5. so about std=0.036.currently the lm_head is inited with
std=1, which is 27.7x too much std or 768x too much variance.https://github.com/PiotrNawrot/nanoT5/blob/1c82d67bf8dea635be68a3b2a68a43b68b665193/nanoT5/utils/t5_model.py#L505C26-L505C26