From 1b0af45656a30585f23d522e0822cbb7fab8855e Mon Sep 17 00:00:00 2001
From: Hongkun Yu <saberkun@users.noreply.github.com>
Date: Tue, 16 Jun 2020 18:37:56 -0700
Subject: [PATCH 01/10] Create 20200616-keras-multihead-attention.md

The feedback phase will be open for two weeks until Wednesday July 02, 2020.

# RFC: Multihead Attention and EinsumDense on Keras

| Status        | (Proposed / Accepted / Implemented / Obsolete)          |
| :------------ | :------------------------------------------------------ |
| **RFC #**     | [NNN](https://github.com/tensorflow/community/pull/NNN) |
:               : (update when you have community PR #)                   :
| **Author(s)** | Hongkun Yu (hongkuny@google.com)                        |
| **Sponsor**   | Francois Chollet (fchollet@google.com)                  |
| **Updated**   | 2020-06-16                                              |

## Objective

Introduce the MultiHeadAttention layer and EinsumDense layer to tf.keras.
---
 rfcs/20200616-keras-multihead-attention.md | 341 +++++++++++++++++++++
 1 file changed, 341 insertions(+)
 create mode 100644 rfcs/20200616-keras-multihead-attention.md

diff --git a/rfcs/20200616-keras-multihead-attention.md b/rfcs/20200616-keras-multihead-attention.md
new file mode 100644
index 000000000..e5a2b0584
--- /dev/null
+++ b/rfcs/20200616-keras-multihead-attention.md
@@ -0,0 +1,341 @@
+# RFC: Multihead Attention and EinsumDense on Keras
+
+| Status        | (Proposed / Accepted / Implemented / Obsolete)          |
+| :------------ | :------------------------------------------------------ |
+| **RFC #**     | [NNN](https://github.com/tensorflow/community/pull/NNN) |
+:               : (update when you have community PR #)                   :
+| **Author(s)** | Hongkun Yu (hongkuny@google.com)                        |
+| **Sponsor**   | Francois Chollet (fchollet@google.com)                  |
+| **Updated**   | 2020-06-16                                              |
+
+## Objective
+
+Introduce the MultiHeadAttention layer and EinsumDense layer to tf.keras.
+
+## Motivation
+
+MultiHeadAttention is very popular and has become standard for deep learning
+libraries. We propose to contribute a flexible well-defined implementation
+inside Keras absorbing common best practices from reference libraries.
+
+## User Benefit
+
+We can standardize the implementation of Transformer layers and use the best
+practice. We offer a rich set of functionalities to different use cases, e.g.
+different project spaces, outputing multi-head attention scores for analysis,
+etc. We also modularize computations to make the MultiHeadAttention layer
+extensible to variants.
+
+## Design Proposal
+
+### Key Features
+
+*   Returns multi-headed attention scores, which is commonly useful for
+    attention visualization and analysis.
+*   Supports query (Q), key (K), value (V) tensors as individual inputs and
+    supports projecting Q, K, V to different dimensions.
+*   Final outputs projects to user specified dimensions.
+*   Using tf.einsum to express high-dimensional computation and adopts
+    [tf.keras.layers.experimental.EinsumDense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/EinsumDense)
+    layer.
+*   Supports high-dimension attention when target and source are 2D, 3D, etc.
+
+### Code Examples
+
+*   How to write a TransformerBlock for an encoder.
+
+```python
+class TransformerBlock(tf.keras.layers.Layer):
+  def __init__(self, embed_dim, num_heads, ff_dim):
+    super(TransformerBlock, self).__init__()
+    self.att = attention.MultiHeadAttention(embed_dim, num_heads)
+    self.ffn = tf.keras.Sequential(
+        [tf.keras.layers.Dense(ff_dim, activation="relu"),
+         tf.keras.layers.Dense(embed_dim),]
+    )
+    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
+    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
+
+  def call(self, inputs, attention_mask=None):
+    attn_output = self.att([inputs, inputs], attention_mask=attention_mask)
+    out1 = self.layernorm1(inputs + attn_output)
+    ffn_output = self.ffn(out1)
+    return self.layernorm2(out1 + ffn_output)
+```
+
+*   Use attention mask to avoid performing attention on padding token indices.
+
+```python
+test_layer = TransformerBlock(
+    embed_dim=2,
+    num_heads=2,
+    ff_dim=4)
+query = np.array([[[0.1, 0.2], [0.0, 0.0]]])
+mask = np.array([[[1, 0], [1, 0]]], dtype='bool')
+output = test_layer(query, mask)
+```
+
+*   Inside a Transformer decoder, we often want to output the cross-attention
+    scores to analyze how the target sequence attend to the source sequence. We
+    are able to visualize the alignment according to attention scores.
+
+```python
+test_layer = MultiHeadAttention(
+    num_heads=2, key_size=2, return_attention_scores=True)
+target = np.array([[[0.1, 0.2], [0.0, 0.0]]])
+source = np.array([[[0.1, 0.2], [3.0, 1.0]]])
+output, scores = test_layer([target, source])
+scores = tf.math.reduce_sum(scores, axis=1) # shape = (1, 2, 2)
+```
+
+*   Attention beyound sequences. Taking 2D, 3D target and source.
+
+```python
+query_shape = [2, 3, 4, 4]  # batch, target, target, embedding.
+value_shape = [2, 3, 2, 4]  # batch, source, source, embedding.
+mask_shape = [2, 3, 4, 3, 2]
+query = 10 * np.random.random_sample(query_shape)
+value = 10 * np.random.random_sample(value_shape)
+mask_data = np.random.randint(2, size=mask_shape).astype("bool")
+output = test_layer([query, value], mask_data)
+```
+
+### Interface
+
+```python
+class MultiHeadAttention(tf.keras.layers.Layer):
+  """MultiHeadAttention layer.
+
+  This is an implementation of multi-headed attention based on "Attention
+  is all you Need". If `query`, `key,` `value` are the same, then
+  this is self-attention. Each timestep in `query` attends to the
+  corresponding sequence in `key`, and returns a fixed-width vector.
+
+  This layer first projects `query`, `key` and `value`. These are
+  (effectively) a list of tensors of length `num_attention_heads`, where the
+  corresponding shapes are [batch_size, <query dimensions>, key_size],
+  [batch_size, <key/value dimensions>, key_size],
+  [batch_size, <key/value dimensions>, value_size].
+
+  Then, the query and key tensors are dot-producted and scaled. These are
+  softmaxed to obtain attention probabilities. The value tensors are then
+  interpolated by these probabilities, then concatenated back to a single
+  tensor.
+
+  Finally, the result tensor with the last dimension as value_size can take an
+  linear projection and return.
+
+  Examples:
+
+  Performs 1D cross-attention over two sequence inputs with an attention mask.
+  Returns the additional attention weights over heads.
+
+  >>> layer = MultiHeadAttention(num_heads=2, key_size=2,
+  ...                            return_attention_scores=True)
+  >>> target = tf.keras.Input(shape=[8, 16])
+  >>> source = tf.keras.Input(shape=[4, 16])
+  >>> mask_tensor = tf.keras.Input(shape=[8, 4])
+  >>> output_tensor, weights = layer([target, source])
+  >>> print(output_tensor.shape), print(weights.shape)
+  (None, 8, 16)  (None, 2, 8, 4)
+
+  Performs 2D self-attention over a 5D input tensor on axes 2 and 3.
+
+  >>> layer = MultiHeadAttention(num_heads=2, key_size=2, attention_axes=(2, 3))
+  >>> input_tensor = tf.keras.Input(shape=[5, 3, 4, 16])
+  >>> output_tensor = layer([input_tensor, input_tensor])
+  >>> print(output_tensor.shape)
+  (None, 5, 3, 4, 16)
+
+  Arguments:
+    num_heads: Number of attention heads.
+    key_size: Size of each attention head for query and key.
+    value_size:  Size of each attention head for value.
+    dropout: Dropout probability for a Dropout layer on attention_scores.
+    use_bias: Boolean, whether the dense layers use bias vectors/matrices.
+    output_shape: The expected shape of an output tensor, besides the batch and
+      sequence dims. If not specified, projects back to the key feature dim.
+    attention_axes: axes over which the attention is applied. `None` means
+      attention over all axes, but batch, heads, and features.
+    return_attention_scores: bool, if `True`, returns the multi-head
+      attention scores as an additional output argument.
+    kernel_initializer: Initializer for dense layer kernels.
+    bias_initializer: Initializer for dense layer biases.
+    kernel_regularizer: Regularizer for dense layer kernels.
+    bias_regularizer: Regularizer for dense layer biases.
+    activity_regularizer: Regularizer for dense layer activity.
+    kernel_constraint: Constraint for dense layer kernels.
+    bias_constraint: Constraint for dense layer kernels.
+  """
+
+  def call(self, inputs, attention_mask=None):
+    """Implements the forward pass.
+
+    Size glossary:
+      * Number of heads (H): the number of attention heads.
+      * Value size (V): the size of each value embedding per head.
+      * Key size (K): the size of each key embedding per head. Equally, the size
+          of each query embedding per head. Typically K <= V.
+      * Batch dimensions (B).
+      * Query (target) attention axes shape (T).
+      * Value (source) attention axes shape (S), the rank must match the target.
+
+    Args:
+      inputs: List of the following tensors:
+        * query: Query `Tensor` of shape `[B, T, dim]`.
+        * value: Value `Tensor` of shape `[B, S, dim]`.
+        * key: Optional key `Tensor` of shape `[B, S, dim]`. If not given, will
+          use `value` for both `key` and `value`, which is the most common case.
+      attention_mask: a boolean mask of shape `[B, T, S]`, that prevents
+        attention to certain positions.
+
+    Returns:
+      attention_output: The result of the computation, of shape [B, T, E],
+        where `T` is for target sequence shapes and `E` is the query input last
+        dimension if `output_shape` is `None`. Otherwise, the multi-head outputs
+        are project to the shape specified by `output_shape`.
+      attention_scores: [Optional] multi-head attention coeffients over
+        attention axes.
+    """
+```
+
+### Auxiliary Layers and Changes
+
+*   EinsumDense layer
+
+We use `tf.einsum` to implement a dense layer can perform einsum calculations of
+arbitrary dimensionality. This example shows how to instantiate a layer that
+applies the same dense operation to every element in a sequence. Here, the
+'output_shape' has two values (since there are two non-batch dimensions in the
+output); the first dimension in the output_shape is `None`, because the sequence
+dimension `b` has an unknown shape.
+
+```python
+layer = EinsumDense("abc,cd->abd", output_shape=(None, 64), bias_axes="d")
+input_tensor = tf.keras.Input(shape=[32, 128])
+output_tensor = layer(input_tensor) # output shape is (None, 32, 64)
+```
+
+*   Masked Softmax
+
+Inside the attention computation, we need to mask logits before softmax and it
+has become a common treatment in many applications. We propose to add an
+optional `mask` argument to `tf.nn.softmax`. The downstream keras `Softmax`
+layer will also take an optional `mask` tensor. This `mask` tensor should have
+the same rank as the input tensor and mask elements on the axis which will
+perform softmax.
+
+Inside `MultiHeadAttention` keras layer, we will use the keras `Softmax` layer
+with mask and adjust attention mask shape to match the inputs. The dimension
+expension logic and multi-axes softmax will be handled locally in
+`MultiHeadAttention` layer.
+
+*   Keras Dense Attention
+
+[tf.keras.layers.Attention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention)
+layer call method takes an optional argument, `mask`, which requires two
+tensors, `q_mask` and `v_mask`. They are following keras framework requirements
+with (batch_size, target_length) and (batch_size, source_length) as shapes. This
+limits the flexibility of masking and `MultiHeadAttention` layer generalize the
+attention mask to be (batch dims, target dims, source dims). To be consistent,
+we would like to introduce an optional argument `attention_mask` for
+`tf.keras.layers.Attention`. In the reduced case of `tf.keras.layers.Attention`,
+the shape is (batch_size, target_length, source_length). Whenever
+`attention_mask` is specified, the `mask` argument is OK to be skipped.
+
+### Alternatives Considered
+
+We examined multi-head attention layer implemented in various libraries. There
+are a few features that we do not include inside this keras layer and we feel it
+is better to subclass the `MultiHeadAttention` layer to fulfill the needs.
+
+*   Attention caching for decoding. Implemented in
+    [Flax](https://github.com/google/flax/blob/master/flax/nn/attention.py#L301).
+    The caching is a special treatment for inference and we noticied that
+    different treatments are required for dynamic or static shape programs.
+    Thus, subclassing as a
+    [CachedAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/attention.py)
+    layer is the solution inside the model garden.
+*   [MultiHeadAttention](https://github.com/tensorflow/addons/blob/master/tensorflow_addons/layers/multihead_attention.py)
+    keras layer is also implemented in TF-Addons. The design in this doc covers
+    the features in TF-addons implementation but generalizes to more use cases.
+
+### Performance Implications
+
+*   We will add microbenchmarks following the common practices of keras layers.
+*   We have end-to-end integration/regression tests for models using this layer,
+    e.g. BERT.
+
+### Dependencies
+
+No dependencies.
+
+### Engineering Impact
+
+*   The keras layer can be tested inside the package.
+*   TensorFlow team will maintain the code.
+
+### Platforms and Environments
+
+*   Work for all platforms and environments
+
+### Best Practices
+
+*   No change for Tensorflow best practices.
+
+### Tutorials and Examples
+
+*   Code examples can be found inside Tensorflow Model Garden. For example, an
+    encoder
+    [Transformer](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer.py).
+
+*   2D attention example in the
+    [unit test](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/attention_test.py#L135).
+
+### Compatibility
+
+*   This is a new layer without compatibility concerns.
+*   The proposal works with TFLite, distribution strategy, tf.function, GPU/TPU
+    and serializable to SavedModel. These are tested inside TensorFlow Model
+    Garden applications.
+
+### User Impact
+
+*   We will first introduce the layer as
+    `tf.keras.layers.experimental.MultiHeadAttention` and
+    `tf.keras.layers.experimental.MaskedSoftmax`.
+
+## Detailed Design
+
+The layer has been implemented as the
+[MultiHeadAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/attention.py#L116)
+inside TensorFlow Model Garden.
+
+First, as we rely on `tf.eisum` to define projections and attention computation,
+we need to figure out the einsum notation of each computation. Furthermore, to
+make the layer generalize to high-dimension cases, i.e. there are more than one
+batch dimensions and attention softmax can be performed on multiple axes, we
+need to track the batch axes and attention axes inside einsum notations. We use
+a vector of chars and use two local methods to generate einsum notations for
+projections and attentions.
+
+Second, the layer by default implements the most common dot-product attention.
+There are various ways to implement the attention computation, so we modulize it
+as two methods `_build_attention` and `_compute_attention`. Thus, users may be
+able to just override them to get a new keras layer with a novel attention
+method. For example, we implemented
+[TalkingHeadAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/talking_heads_attention.py)
+introduced by ["Talking-Heads Attention "](https://arxiv.org/abs/2003.02436)
+paper. Using the keras Attention layer as another example, since it supports the
+basic single-head case 1-D attention, we can use it inside `_build_attention`
+and `_compute_attention`.
+
+## Questions and Discussion Topics
+
+-   cuDNN has the
+    [multi-head attention](https://docs.nvidia.com/deeplearning/sdk/cudnn-api/index.html#cudnnMultiHeadAttnForward)
+    function. How do we incorporate it? A: we modularize the attention
+    computation components in order to support new low-level functions without
+    changing this layer interface. The cuDNN function supports the classic
+    dot-product attention with classic input dimensions. We will be able to use
+    it once TensorFlow add an op to use it.

From ec047a9fa4d08b1e57c7ee9b6173c084958d6c7b Mon Sep 17 00:00:00 2001
From: Hongkun Yu <saberkun@users.noreply.github.com>
Date: Tue, 16 Jun 2020 18:41:00 -0700
Subject: [PATCH 02/10] Update 20200616-keras-multihead-attention.md

---
 rfcs/20200616-keras-multihead-attention.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/rfcs/20200616-keras-multihead-attention.md b/rfcs/20200616-keras-multihead-attention.md
index e5a2b0584..fc3e62942 100644
--- a/rfcs/20200616-keras-multihead-attention.md
+++ b/rfcs/20200616-keras-multihead-attention.md
@@ -2,8 +2,7 @@
 
 | Status        | (Proposed / Accepted / Implemented / Obsolete)          |
 | :------------ | :------------------------------------------------------ |
-| **RFC #**     | [NNN](https://github.com/tensorflow/community/pull/NNN) |
-:               : (update when you have community PR #)                   :
+| **RFC #**     | [260](https://github.com/tensorflow/community/pull/260) |
 | **Author(s)** | Hongkun Yu (hongkuny@google.com)                        |
 | **Sponsor**   | Francois Chollet (fchollet@google.com)                  |
 | **Updated**   | 2020-06-16                                              |

From 047209568fac66f3fd365cf811d5187fd6f40c6d Mon Sep 17 00:00:00 2001
From: Hongkun Yu <saberkun@users.noreply.github.com>
Date: Tue, 16 Jun 2020 18:44:03 -0700
Subject: [PATCH 03/10] Update 20200616-keras-multihead-attention.md

    `tf.keras.layers.experimental.MultiHeadAttention` and
    `tf.keras.layers.experimental.EinsumDense`
---
 rfcs/20200616-keras-multihead-attention.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/20200616-keras-multihead-attention.md b/rfcs/20200616-keras-multihead-attention.md
index fc3e62942..3b3198ba6 100644
--- a/rfcs/20200616-keras-multihead-attention.md
+++ b/rfcs/20200616-keras-multihead-attention.md
@@ -302,7 +302,7 @@ No dependencies.
 
 *   We will first introduce the layer as
     `tf.keras.layers.experimental.MultiHeadAttention` and
-    `tf.keras.layers.experimental.MaskedSoftmax`.
+    `tf.keras.layers.experimental.EinsumDense`.
 
 ## Detailed Design
 

From 6b1b201eb0328ecd5d6920585a7fb7f50bca7d8a Mon Sep 17 00:00:00 2001
From: Hongkun Yu <saberkun@users.noreply.github.com>
Date: Tue, 16 Jun 2020 21:33:13 -0700
Subject: [PATCH 04/10] Update 20200616-keras-multihead-attention.md

Add mark to authors.
Add plan for addons migration.
---
 rfcs/20200616-keras-multihead-attention.md | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/rfcs/20200616-keras-multihead-attention.md b/rfcs/20200616-keras-multihead-attention.md
index 3b3198ba6..e67b8852e 100644
--- a/rfcs/20200616-keras-multihead-attention.md
+++ b/rfcs/20200616-keras-multihead-attention.md
@@ -3,7 +3,7 @@
 | Status        | (Proposed / Accepted / Implemented / Obsolete)          |
 | :------------ | :------------------------------------------------------ |
 | **RFC #**     | [260](https://github.com/tensorflow/community/pull/260) |
-| **Author(s)** | Hongkun Yu (hongkuny@google.com)                        |
+| **Author(s)** | Hongkun Yu (hongkuny@google.com), Mark Omernick (momernick@google.com)    |
 | **Sponsor**   | Francois Chollet (fchollet@google.com)                  |
 | **Updated**   | 2020-06-16                                              |
 
@@ -242,6 +242,15 @@ we would like to introduce an optional argument `attention_mask` for
 the shape is (batch_size, target_length, source_length). Whenever
 `attention_mask` is specified, the `mask` argument is OK to be skipped.
 
+* TFA `MultiHeadAttention` Deprecation and Re-mapping
+
+[MultiHeadAttention](https://github.com/tensorflow/addons/blob/master/tensorflow_addons/layers/multihead_attention.py) has been released. The proposed `MultiHeadAttention` has similar `__init__` arguments
+and `call` interface, where the minor differences are argument names and the attention `mask` shape.
+We expect the new `MultiHeadAttention` keras layer will 
+cover the functionalities. Once the implementation are merged as experimental layers,
+we will work with TF Addons team to design the deprecation and re-mapping procedure.
+
+
 ### Alternatives Considered
 
 We examined multi-head attention layer implemented in various libraries. There

From 40b799b18c34ad7d323212e24ae861aa9319ec0a Mon Sep 17 00:00:00 2001
From: Hongkun Yu <saberkun@users.noreply.github.com>
Date: Wed, 17 Jun 2020 09:37:37 -0700
Subject: [PATCH 05/10] Update 20200616-keras-multihead-attention.md

---
 rfcs/20200616-keras-multihead-attention.md | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/rfcs/20200616-keras-multihead-attention.md b/rfcs/20200616-keras-multihead-attention.md
index e67b8852e..01316019f 100644
--- a/rfcs/20200616-keras-multihead-attention.md
+++ b/rfcs/20200616-keras-multihead-attention.md
@@ -311,7 +311,11 @@ No dependencies.
 
 *   We will first introduce the layer as
     `tf.keras.layers.experimental.MultiHeadAttention` and
-    `tf.keras.layers.experimental.EinsumDense`.
+    `tf.keras.layers.experimental.EinsumDense`. When the APIs are stable and
+    functionalities are fully verified, the next step is to
+    graduate as core keras layers by removing `experimental` scope.
+    
+    
 
 ## Detailed Design
 

From 33b20f175aed7b419396426195a14be06c0d3505 Mon Sep 17 00:00:00 2001
From: Hongkun Yu <saberkun@users.noreply.github.com>
Date: Wed, 17 Jun 2020 23:56:44 -0700
Subject: [PATCH 06/10] Use the mask_tensor inside the example code.

---
 rfcs/20200616-keras-multihead-attention.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/20200616-keras-multihead-attention.md b/rfcs/20200616-keras-multihead-attention.md
index 01316019f..fff5afa57 100644
--- a/rfcs/20200616-keras-multihead-attention.md
+++ b/rfcs/20200616-keras-multihead-attention.md
@@ -134,7 +134,7 @@ class MultiHeadAttention(tf.keras.layers.Layer):
   >>> target = tf.keras.Input(shape=[8, 16])
   >>> source = tf.keras.Input(shape=[4, 16])
   >>> mask_tensor = tf.keras.Input(shape=[8, 4])
-  >>> output_tensor, weights = layer([target, source])
+  >>> output_tensor, weights = layer([target, source], attention_mask=mask_tensor)
   >>> print(output_tensor.shape), print(weights.shape)
   (None, 8, 16)  (None, 2, 8, 4)
 

From b3daa1d2b0d4a642f6f28f7f42ec6aa9ca35c32f Mon Sep 17 00:00:00 2001
From: Hongkun Yu <saberkun@users.noreply.github.com>
Date: Fri, 26 Jun 2020 11:35:03 -0700
Subject: [PATCH 07/10] Update 20200616-keras-multihead-attention.md

---
 rfcs/20200616-keras-multihead-attention.md | 59 +++++++++++-----------
 1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/rfcs/20200616-keras-multihead-attention.md b/rfcs/20200616-keras-multihead-attention.md
index fff5afa57..9a1907079 100644
--- a/rfcs/20200616-keras-multihead-attention.md
+++ b/rfcs/20200616-keras-multihead-attention.md
@@ -3,7 +3,7 @@
 | Status        | (Proposed / Accepted / Implemented / Obsolete)          |
 | :------------ | :------------------------------------------------------ |
 | **RFC #**     | [260](https://github.com/tensorflow/community/pull/260) |
-| **Author(s)** | Hongkun Yu (hongkuny@google.com), Mark Omernick (momernick@google.com)    |
+| **Author(s)** | Hongkun Yu (hongkuny@google.com), Mark Omernick (momernick@google.com) |
 | **Sponsor**   | Francois Chollet (fchollet@google.com)                  |
 | **Updated**   | 2020-06-16                                              |
 
@@ -83,7 +83,7 @@ test_layer = MultiHeadAttention(
     num_heads=2, key_size=2, return_attention_scores=True)
 target = np.array([[[0.1, 0.2], [0.0, 0.0]]])
 source = np.array([[[0.1, 0.2], [3.0, 1.0]]])
-output, scores = test_layer([target, source])
+output, scores = test_layer(query=target, value=source)
 scores = tf.math.reduce_sum(scores, axis=1) # shape = (1, 2, 2)
 ```
 
@@ -96,7 +96,7 @@ mask_shape = [2, 3, 4, 3, 2]
 query = 10 * np.random.random_sample(query_shape)
 value = 10 * np.random.random_sample(value_shape)
 mask_data = np.random.randint(2, size=mask_shape).astype("bool")
-output = test_layer([query, value], mask_data)
+output = test_layer(query=query, value=value, attention_mask=mask_data)
 ```
 
 ### Interface
@@ -134,7 +134,8 @@ class MultiHeadAttention(tf.keras.layers.Layer):
   >>> target = tf.keras.Input(shape=[8, 16])
   >>> source = tf.keras.Input(shape=[4, 16])
   >>> mask_tensor = tf.keras.Input(shape=[8, 4])
-  >>> output_tensor, weights = layer([target, source], attention_mask=mask_tensor)
+  >>> output_tensor, weights = layer(query=target, value=source
+  ...                                attention_mask=mask_tensor)
   >>> print(output_tensor.shape), print(weights.shape)
   (None, 8, 16)  (None, 2, 8, 4)
 
@@ -142,7 +143,7 @@ class MultiHeadAttention(tf.keras.layers.Layer):
 
   >>> layer = MultiHeadAttention(num_heads=2, key_size=2, attention_axes=(2, 3))
   >>> input_tensor = tf.keras.Input(shape=[5, 3, 4, 16])
-  >>> output_tensor = layer([input_tensor, input_tensor])
+  >>> output_tensor = layer(query=input_tensor, value=input_tensor)
   >>> print(output_tensor.shape)
   (None, 5, 3, 4, 16)
 
@@ -167,7 +168,7 @@ class MultiHeadAttention(tf.keras.layers.Layer):
     bias_constraint: Constraint for dense layer kernels.
   """
 
-  def call(self, inputs, attention_mask=None):
+  def call(self, query, value, key=None, attention_mask=None):
     """Implements the forward pass.
 
     Size glossary:
@@ -180,10 +181,9 @@ class MultiHeadAttention(tf.keras.layers.Layer):
       * Value (source) attention axes shape (S), the rank must match the target.
 
     Args:
-      inputs: List of the following tensors:
-        * query: Query `Tensor` of shape `[B, T, dim]`.
-        * value: Value `Tensor` of shape `[B, S, dim]`.
-        * key: Optional key `Tensor` of shape `[B, S, dim]`. If not given, will
+      query: Query `Tensor` of shape `[B, T, dim]`.
+      value: Value `Tensor` of shape `[B, S, dim]`.
+      key: Optional key `Tensor` of shape `[B, S, dim]`. If not given, will
           use `value` for both `key` and `value`, which is the most common case.
       attention_mask: a boolean mask of shape `[B, T, S]`, that prevents
         attention to certain positions.
@@ -242,14 +242,15 @@ we would like to introduce an optional argument `attention_mask` for
 the shape is (batch_size, target_length, source_length). Whenever
 `attention_mask` is specified, the `mask` argument is OK to be skipped.
 
-* TFA `MultiHeadAttention` Deprecation and Re-mapping
-
-[MultiHeadAttention](https://github.com/tensorflow/addons/blob/master/tensorflow_addons/layers/multihead_attention.py) has been released. The proposed `MultiHeadAttention` has similar `__init__` arguments
-and `call` interface, where the minor differences are argument names and the attention `mask` shape.
-We expect the new `MultiHeadAttention` keras layer will 
-cover the functionalities. Once the implementation are merged as experimental layers,
-we will work with TF Addons team to design the deprecation and re-mapping procedure.
+*   TFA `MultiHeadAttention` Deprecation and Re-mapping
 
+[MultiHeadAttention](https://github.com/tensorflow/addons/blob/master/tensorflow_addons/layers/multihead_attention.py)
+has been released. The proposed `MultiHeadAttention` has similar `__init__`
+arguments and `call` interface, where the minor differences are argument names
+and the attention `mask` shape. We expect the new `MultiHeadAttention` keras
+layer will cover the functionalities. Once the implementation are merged as
+experimental layers, we will work with TF Addons team to design the deprecation
+and re-mapping procedure.
 
 ### Alternatives Considered
 
@@ -307,15 +308,13 @@ No dependencies.
     and serializable to SavedModel. These are tested inside TensorFlow Model
     Garden applications.
 
-### User Impact
+### User Impacteisum
 
 *   We will first introduce the layer as
     `tf.keras.layers.experimental.MultiHeadAttention` and
     `tf.keras.layers.experimental.EinsumDense`. When the APIs are stable and
-    functionalities are fully verified, the next step is to
-    graduate as core keras layers by removing `experimental` scope.
-    
-    
+    functionalities are fully verified, the next step is to graduate as core
+    keras layers by removing `experimental` scope.
 
 ## Detailed Design
 
@@ -323,17 +322,17 @@ The layer has been implemented as the
 [MultiHeadAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/attention.py#L116)
 inside TensorFlow Model Garden.
 
-First, as we rely on `tf.eisum` to define projections and attention computation,
-we need to figure out the einsum notation of each computation. Furthermore, to
-make the layer generalize to high-dimension cases, i.e. there are more than one
-batch dimensions and attention softmax can be performed on multiple axes, we
-need to track the batch axes and attention axes inside einsum notations. We use
-a vector of chars and use two local methods to generate einsum notations for
-projections and attentions.
+First, as we rely on `tf.einsum` to define projections and attention
+computation, we need to figure out the einsum notation of each computation.
+Furthermore, to make the layer generalize to high-dimension cases, i.e. there
+are more than one batch dimensions and attention softmax can be performed on
+multiple axes, we need to track the batch axes and attention axes inside einsum
+notations. We use a vector of chars and use two local methods to generate einsum
+notations for projections and attentions.
 
 Second, the layer by default implements the most common dot-product attention.
 There are various ways to implement the attention computation, so we modulize it
-as two methods `_build_attention` and `_compute_attention`. Thus, users may be
+as two methods `build_attention` and `compute_attention`. Thus, users will be
 able to just override them to get a new keras layer with a novel attention
 method. For example, we implemented
 [TalkingHeadAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/talking_heads_attention.py)

From 4ffd12793dfad3db3335a614490a63e236fc5370 Mon Sep 17 00:00:00 2001
From: ematejska <ematejska@google.com>
Date: Tue, 7 Jul 2020 17:03:51 -0700
Subject: [PATCH 08/10] Update 20200616-keras-multihead-attention.md

---
 rfcs/20200616-keras-multihead-attention.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/20200616-keras-multihead-attention.md b/rfcs/20200616-keras-multihead-attention.md
index 9a1907079..316010208 100644
--- a/rfcs/20200616-keras-multihead-attention.md
+++ b/rfcs/20200616-keras-multihead-attention.md
@@ -1,6 +1,6 @@
 # RFC: Multihead Attention and EinsumDense on Keras
 
-| Status        | (Proposed / Accepted / Implemented / Obsolete)          |
+| Status        | Proposed         |
 | :------------ | :------------------------------------------------------ |
 | **RFC #**     | [260](https://github.com/tensorflow/community/pull/260) |
 | **Author(s)** | Hongkun Yu (hongkuny@google.com), Mark Omernick (momernick@google.com) |

From 9778f63efcf79be7570cf5811193240b012035d7 Mon Sep 17 00:00:00 2001
From: Hongkun Yu <saberkun@users.noreply.github.com>
Date: Mon, 20 Jul 2020 12:26:13 -0700
Subject: [PATCH 09/10] Update two proposed changes

Update two proposed changes to the existing Attention layer
---
 rfcs/20200616-keras-multihead-attention.md | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/rfcs/20200616-keras-multihead-attention.md b/rfcs/20200616-keras-multihead-attention.md
index 316010208..26cf75b15 100644
--- a/rfcs/20200616-keras-multihead-attention.md
+++ b/rfcs/20200616-keras-multihead-attention.md
@@ -231,8 +231,9 @@ expension logic and multi-axes softmax will be handled locally in
 
 *   Keras Dense Attention
 
-[tf.keras.layers.Attention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention)
-layer call method takes an optional argument, `mask`, which requires two
+We have two changes proposed to
+[tf.keras.layers.Attention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention).
+(1) The layer call method takes an optional argument, `mask`, which requires two
 tensors, `q_mask` and `v_mask`. They are following keras framework requirements
 with (batch_size, target_length) and (batch_size, source_length) as shapes. This
 limits the flexibility of masking and `MultiHeadAttention` layer generalize the
@@ -241,6 +242,9 @@ we would like to introduce an optional argument `attention_mask` for
 `tf.keras.layers.Attention`. In the reduced case of `tf.keras.layers.Attention`,
 the shape is (batch_size, target_length, source_length). Whenever
 `attention_mask` is specified, the `mask` argument is OK to be skipped.
+(2) The layer does not return attention scores. We will add the bool argument,
+`return_attention_scores` to the __init__ and return the attention score tensor if
+it is true.
 
 *   TFA `MultiHeadAttention` Deprecation and Re-mapping
 

From 72c0662d1b161e1e5f0167c0fcd135672a4de4f2 Mon Sep 17 00:00:00 2001
From: Hongkun Yu <saberkun@users.noreply.github.com>
Date: Mon, 20 Jul 2020 12:27:57 -0700
Subject: [PATCH 10/10] Update to accepted

---
 rfcs/20200616-keras-multihead-attention.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/rfcs/20200616-keras-multihead-attention.md b/rfcs/20200616-keras-multihead-attention.md
index 26cf75b15..c930fbfe1 100644
--- a/rfcs/20200616-keras-multihead-attention.md
+++ b/rfcs/20200616-keras-multihead-attention.md
@@ -1,6 +1,6 @@
 # RFC: Multihead Attention and EinsumDense on Keras
 
-| Status        | Proposed         |
+| Status        | Accepted         |
 | :------------ | :------------------------------------------------------ |
 | **RFC #**     | [260](https://github.com/tensorflow/community/pull/260) |
 | **Author(s)** | Hongkun Yu (hongkuny@google.com), Mark Omernick (momernick@google.com) |
@@ -342,8 +342,8 @@ method. For example, we implemented
 [TalkingHeadAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/talking_heads_attention.py)
 introduced by ["Talking-Heads Attention "](https://arxiv.org/abs/2003.02436)
 paper. Using the keras Attention layer as another example, since it supports the
-basic single-head case 1-D attention, we can use it inside `_build_attention`
-and `_compute_attention`.
+basic single-head case 1-D attention, we can use it inside `build_attention`
+and `compute_attention`.
 
 ## Questions and Discussion Topics