Why are PRs, such as the current record #1855, doing this?
Looking at prior documents does not break causality. Any LLM that doesn't use intra-document masking is already looking at prior documents through attention. There is no 'cheating' involved here, unless the maintainers have created an arbitrary ruling on this. When all of these 1 position techniques like smear gate and bigram hash were created, both the masked and unmasked versions were tested, and the unmasked version was intentionally selected because it ran faster, didn't hurt loss, and obeyed the causal mask.
I am concerned that every record is going to copy paste this, and the final record is going to have this janky inefficiency for no good reason.
Why are PRs, such as the current record #1855, doing this?
Looking at prior documents does not break causality. Any LLM that doesn't use intra-document masking is already looking at prior documents through attention. There is no 'cheating' involved here, unless the maintainers have created an arbitrary ruling on this. When all of these 1 position techniques like smear gate and bigram hash were created, both the masked and unmasked versions were tested, and the unmasked version was intentionally selected because it ran faster, didn't hurt loss, and obeyed the causal mask.
I am concerned that every record is going to copy paste this, and the final record is going to have this janky inefficiency for no good reason.