Implementation of MultiHeadAttention is incomplete. #1940

saahiluppal · 2020-06-19T13:51:38Z

while MultiHeadAttention should have to pay attention to two types of attention masks,

That masks the padding (called padding mask)
That prevents positions from attending (called attention mask)

While current implementation seems to have only one mask i.e. that masks the padding (1st option).

bhack · 2020-06-19T14:06:51Z

Please if you want more features help us to review tensorflow/community#260.

saberkun · 2020-06-19T16:36:51Z

@saahiluppal Thank you! Yes, please help us review and discuss in the RFC.
The padding mask, causal_mask mask and segment mask can finally compose an attention mask tensor which is (batch size, ....., target sequence, source sequence). If we put mask composition logic outside the layer and pass the attention mask in, it should be flexible to cover use cases. Please correct me if I am wrong. (Yes, we can move the discussion to the RFC. Thanks!)

seanpmorgan · 2020-06-21T12:21:17Z

Closing as this discussion should continue in the new RFC. Thanks!

seanpmorgan closed this as completed Jun 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of MultiHeadAttention is incomplete. #1940

Implementation of MultiHeadAttention is incomplete. #1940

saahiluppal commented Jun 19, 2020 •

edited

Loading

bhack commented Jun 19, 2020 •

edited

Loading

saberkun commented Jun 19, 2020

seanpmorgan commented Jun 21, 2020

Implementation of MultiHeadAttention is incomplete. #1940

Implementation of MultiHeadAttention is incomplete. #1940

Comments

saahiluppal commented Jun 19, 2020 • edited Loading

bhack commented Jun 19, 2020 • edited Loading

saberkun commented Jun 19, 2020

seanpmorgan commented Jun 21, 2020

saahiluppal commented Jun 19, 2020 •

edited

Loading

bhack commented Jun 19, 2020 •

edited

Loading