RFC: Keras categorical inputs #188

tanzhenyu · 2019-12-13T00:09:02Z

Comment period is open till Dec 31, 2019.

Keras categorical inputs

Status	Proposed
RFC #	188
Author(s)	Zhenyu Tan ([email protected]), Francois Chollet ([email protected])
Sponsor	Karmel Allison ([email protected]), Martin Wicke ([email protected])
Updated	2019-12-12

Objective

This document proposes 4 new preprocessing Keras layers (CategoryLookup, CategoryCrossing, CategoryEncoding, CategoryHashing), and 1 additional op (to_sparse) to allow users to:

Perform feature engineering for categorical inputs
Replace feature columns and tf.keras.layers.DenseFeatures with proposed layers
Introduce sparse inputs that work with Keras linear models and other layers that support sparsity

martinwicke · 2019-12-13T00:23:38Z

rfcs/20191212-keras-categorical-inputs.md

+
+model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])
+
+dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')


It seems that this is referenced before assignment? Does this code run?

Fix applied.

martinwicke · 2019-12-13T00:29:49Z

rfcs/20191212-keras-categorical-inputs.md

+Proposed:
+```python
+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)
+layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))


If we allowed to specify the hash function, this could also be folded into the CategoryHashing with an IdentityHash.

CategoryHashing does not check oov values, so if we have that then it's complicating the signature.

Why do you not use a layer to the lamda layer?

How about adding a layer to do the work of categorical_column_with_identity?

The lambda expression is a little complicated

We may need the layer to handle both dense tensor and SparseTensor.

martinwicke

Can you add (or not delete) the other questions in the template? The API and workflow section is very good.

tanzhenyu · 2019-12-13T00:55:46Z

Can you add (or not delete) the other questions in the template? The API and workflow section is very good.

Done.

ebrevdo

to_sparse is too global. i'd move it somewhere under keras. and name it something more specific.

TF2 has several notions of sparsity including SparseTensor, SparseMatrix (coming), and possibly others in the future.

ebrevdo · 2019-12-14T18:39:49Z

rfcs/20191212-keras-categorical-inputs.md

+
+```python
+`tf.keras.layers.CategoryLookup`
+CategoryLookup(PreprocessingLayer):


Please don't forget to implement the correct compute_output_signature for these classes, since they will accept SparseTensorSpecs, and must emit SparseTensorSpecs in this case.

Thanks for the reminder!

ebrevdo · 2019-12-14T18:43:34Z

rfcs/20191212-keras-categorical-inputs.md

+
+Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).
+
+### Workflow 1


Would it make sense to provide an example workflow where you have to get the vocabulary from e.g. a csv file using tf.data and in particular, tf.data.experimental.unique and tf.data.experimental.get_single_element to read out the tensor? @jsimsa wdyt?

This is gonna be very common, i think, in real use cases.

ebrevdo · 2019-12-14T18:44:49Z

rfcs/20191212-keras-categorical-inputs.md

+Proposed:
+```python
+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)
+layer = tf.keras.layers.CategoryLookup(


Ah I see, so here you allow the user to provide a vocabulary file directly, so the tf.data example may not be necessary. May still be useful if users have vocab that needs to be munged a bit before reading directly. But less important.

I think this layer is more of a complimentary to that, i.e., tf.data can parse records and generate vocab file, of read vocab file and do other processing and still return string tensors. This layer is taken from that and convert things to indices before it gets to embedding.

jsimsa · 2019-12-18T23:59:31Z

rfcs/20191212-keras-categorical-inputs.md

+      vocabulary: the vocabulary to lookup the input. If it is a file, it represents the 
+              source vocab file; If it is a list/tuple, it represents the source vocab 
+              list; If it is None, the vocabulary can later be set.


what is the format of the file? how do you set the vocabulary later? what is the expected use of the adapt method?

format of the file is same as
a) any other TFX vocab file, or
b) this test file: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/feature_column/testdata/warriors_vocabulary.txt

users from feature columns world will set it during init, but this layer also allow users to call 'adapt' to derive/set the vocabulary from dataset.

My point is that this should be documented. Stating the the vocabulary can be set later without showing how is not useful.

jsimsa · 2019-12-18T23:59:57Z

rfcs/20191212-keras-categorical-inputs.md

+`tf.keras.layers.CategoryCrossing`
+CategoryCrossing(PreprocessingLayer):
+"""This layer transforms multiple categorical inputs to categorical outputs
+   by Cartesian product. and hash the output if necessary.


nit: remove extra .

jsimsa · 2019-12-19T00:01:03Z

rfcs/20191212-keras-categorical-inputs.md

+CategoryCrossing(PreprocessingLayer):
+"""This layer transforms multiple categorical inputs to categorical outputs
+   by Cartesian product. and hash the output if necessary.
+   If any input is sparse, then output is sparse, otherwise dense."""


OOC, why is the wording here different than in the other API endpoints (it seems that the intended behavior is the same?)

Good question. This is the only layer that can accept multiple inputs. Other API only accept a single Tensor/SparseTensor.
So by multiple inputs, if any one of them is sparse, the output will be sparse.

Got it. Maybe you should say, "If any of the inputs is sparse, then all outputs will be sparse. Otherwise, all outputs will be dense."

Yeah that's better. Done.

jsimsa · 2019-12-19T00:03:34Z

rfcs/20191212-keras-categorical-inputs.md

+             combined into all combinations of output with degree of `depth`. For example,
+             with inputs `a`, `b` and `c`, `depth=2` means the output will be [ab;ac;bc]


the example should be moved to the "Example" section below

Both Example for each layer description, and a code snippet below.

jsimsa · 2019-12-19T00:06:02Z

rfcs/20191212-keras-categorical-inputs.md

+    If the layer receives two inputs, `a=[[1, 2]]` and `b=[[1, 3]]`,
+    and if depth is 2, then
+    the output will be a single integer tensor `[[i, j, k, l]]`, where:
+    i is the index of the category "a1=1 and b1=1"
+    j is the index of the category "a1=1 and b2=3"
+    k is the index of the category "a2=2 and b1=1"
+    l is the index of the category "a2=2 and b2=3"


I don't understand this example. What is a1 vs a2? What will the "single integer tensor [[i, j, k, l]]" actually look like for the given inputs?

Yeah it is confusing. Updated.

jsimsa · 2019-12-19T00:06:33Z

rfcs/20191212-keras-categorical-inputs.md

+    pass
+
+`tf.keras.layers.CategoryEncoding`
+CategoryEncoding(PreprocessingLayer):


Please add example for this layer.

jsimsa · 2019-12-19T00:06:45Z

rfcs/20191212-keras-categorical-inputs.md

+    pass
+
+`tf.keras.layers.CategoryHashing`
+CategoryHashing(PreprocessingLayer):


Please add example for this layer.

jsimsa · 2019-12-19T00:08:57Z

rfcs/20191212-keras-categorical-inputs.md

+
+```
+
+We also propose a `to_sparse` op to convert dense tensors to sparse tensors given user specified ignore values. This op can be used in both `tf.data` or [TF Transform](https://www.tensorflow.org/tfx/transform/get_started). In previous feature column world, "" is ignored for dense string input and -1 is ignored for dense int input.


I would prefer if the SparseTensor class had a from_dense method.

If we don't need the functionality of sparse_output = to_sparse(sparse_input), then from_dense is probably better.
This "imagined" functionality is not used anywhere though. In TFT I think any tf.io.VarLenFeature should automatically be sparse input, we just need to call SparseTensor.from_dense for any tf.io.FixedLenFeature.

WDYT?

I realized, we already have from_dense, so perhaps you should just extend it with the option to set the element to be ignore?

Yeah good point. I wasn't aware of this op. We should just extend it. Done.

jsimsa · 2019-12-19T00:09:27Z

rfcs/20191212-keras-categorical-inputs.md

+```python
+`tf.to_sparse`
+def to_sparse(input, ignore_value):
+  """Convert dense/sparse tensor to sparse while dropping user specified values.


What is the benefit of calling this API with a SparseTensor input?

to allow users to filter specified values, e.g., if the original input is already sparse:
indices = [[0,0], [1, 0], [1,1]]
values = ['A', '', 'C']
the user can still filter '' from it,

This filtering can be built out of existing operations. You can call, tf.where on values and pass the result to tf.sparse.retain, which is simple enough that I do not see the point of introducing syntactic sugar for that.

Makes sense. Let's just extend the tf.sparse.from_dense op.

tensorflow/community#188 PiperOrigin-RevId: 286486505 Change-Id: I0fa15cb157076f86fd662215aedda6d5761d915d

tanzhenyu · 2020-02-06T14:50:01Z

@tanzhenyu were there notes from the design review meeting that could be posted here?

Yeah I will update it soon.

haifeng-jin

@tanzhenyu The layers look great!

From AutoKeras perspective, I would like a workflow that would let the user input the CSV data they load to a single input node instead of many.
Hide the single column tensors inside the Keras Model.

Use one "decompose" layer after the input node to tear the CSV data into single column tensors and feed the categorical layers.

Is this possible?

haifeng-jin · 2020-02-12T02:19:30Z

rfcs/20191212-keras-categorical-inputs.md

+	vocab_list = sorted(dftrain[feature_name].unique())
+	# Map string values to indices
+	x = tf.keras.layers.Lookup(vocabulary=vocab_list, name=feature_name)(feature_input)
+  x = tf.keras.layers.Vectorize(num_categories=len(vocab_list))(x)


incorrect indent.

haifeng-jin · 2020-02-12T02:20:26Z

rfcs/20191212-keras-categorical-inputs.md

+
+model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])
+
+dataset = tf.data.Dataset.from_tensor_slices((


Is it possible to have a single input node and use some layer to decompose the input to these single column nodes?

use tf.split for that?

googlebot · 2020-02-26T17:40:06Z

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

ematejska · 2020-02-26T17:41:42Z

The CLA looks good. tanzhenyu signed the CLA but some of the commits were from his laptop acocunt.

tanzhenyu · 2020-02-26T23:49:24Z

The CLA looks good. tanzhenyu signed the CLA but some of the commits were from his laptop acocunt.

Yeah I created #209 as a workaround -- which one should we merge here?

rfcs/20191212-keras-categorical-inputs.md

ematejska · 2020-02-28T22:40:14Z

Let's continue in this one since it has the history and get this one merged. Also, do you have notes from the review meeting you could post here?

tanzhenyu · 2020-03-14T09:57:29Z

Let's continue in this one since it has the history and get this one merged. Also, do you have notes from the review meeting you could post here?

It doesn't seem I could update this once the original branch is gone. What can we do to fix here?

workingloong · 2020-03-16T11:47:22Z

With tf.feature_column, we can use embedding_column to wrap category_column and convert category_column output to dense tensor. Whether can Keras Category layers support the function like embedding_column?
Meanwhile, tf.keras.layers.Embedding cannot support sparseTensor inputs which may be the output of CategoryLookup. I have created an issue tensorflow/tensorflow#33880 to embedding lookup with sparseTensor.

You can use dense tensor input to CategoryLookup, which gives you dense tensor output, and feed that into tf.keras.layers.Embedding.

Maybe we should support sparse input in embedding layer.

In the RFC, the preprocessing layers like Lookup and FingerPrint support SparseTenor in and out. But now, we cannot feed the output SparseTensor into tf.keras.layers.Embedding. Will the embedding with sparse input be released with those preprocessing layers?

tanzhenyu · 2020-03-17T01:27:11Z

With tf.feature_column, we can use embedding_column to wrap category_column and convert category_column output to dense tensor. Whether can Keras Category layers support the function like embedding_column?
Meanwhile, tf.keras.layers.Embedding cannot support sparseTensor inputs which may be the output of CategoryLookup. I have created an issue tensorflow/tensorflow#33880 to embedding lookup with sparseTensor.

You can use dense tensor input to CategoryLookup, which gives you dense tensor output, and feed that into tf.keras.layers.Embedding.
Maybe we should support sparse input in embedding layer.

In the RFC, the preprocessing layers like Lookup and FingerPrint support SparseTenor in and out. But now, we cannot feed the output SparseTensor into tf.keras.layers.Embedding. Will the embedding with sparse input be released with those preprocessing layers?

That's a very good question. We're currently gathering use cases for supporting sparse with Embedding layer, Lookup is definitely the most important one, but if you have other use cases, would you mind sharing with us?

workingloong · 2020-03-17T09:43:44Z

That's a very good question. We're currently gathering use cases for supporting sparse with Embedding layer, Lookup is definitely the most important one, but if you have other use cases, would you mind sharing with us?

Yes. Except Lookup, Hashing and Bucketize which support sparse in and sparse out should be use cases. In ElasticDL, we are developing some preprocessing layers like RoundIdentity. What's more, we sometimes need to convert the inputs to sparse tensor ignoring missing value and transform the sparse tensor using Keras preprocessing layers. The detail is in the ElasticDL issue.

In TF 2.1.0, I find that the tf.keras.layers.Embedding has supported RaggedTensor and output RaggedTensor with embedding vectors. In my opinion, the function of RaggedTensor is the same as SparseTensor. And we can use tf.keras.layers.Embedding with ragged tensor and tf.reduce operators like tf.reduce_sum to implement the feature of tf.nn.safe_embedding_lookup_sparse with combiner. Except tf.keras.layers.Embedding, the tf.keras.layers.Concatenate also support RaggedTensor but not SparseTensor. How do you think about the difference and relationship between SparseTensor and RaggedTensor along with embedding?

tanzhenyu · 2020-03-25T00:45:00Z

That's a very good question. We're currently gathering use cases for supporting sparse with Embedding layer, Lookup is definitely the most important one, but if you have other use cases, would you mind sharing with us?

Yes. Except Lookup, Hashing and Bucketize which support sparse in and sparse out should be use cases. In ElasticDL, we are developing some preprocessing layers like RoundIdentity. What's more, we sometimes need to convert the inputs to sparse tensor ignoring missing value and transform the sparse tensor using Keras preprocessing layers. The detail is in the ElasticDL issue.

In TF 2.1.0, I find that the tf.keras.layers.Embedding has supported RaggedTensor and output RaggedTensor with embedding vectors. In my opinion, the function of RaggedTensor is the same as SparseTensor. And we can use tf.keras.layers.Embedding with ragged tensor and tf.reduce operators like tf.reduce_sum to implement the feature of tf.nn.safe_embedding_lookup_sparse with combiner. Except tf.keras.layers.Embedding, the tf.keras.layers.Concatenate also support RaggedTensor but not SparseTensor. How do you think about the difference and relationship between SparseTensor and RaggedTensor along with embedding?

Sorry for the delay.

Yep I think we should support sparse inputs for Embedding layer.
The major difference between SparseTensor and RaggedTensor is the former can represent things with missing value, i.e.:
[[0, N/A, 1 ]
[2, 1, N/A]]
which is usually used in structured data, while the latter cannot do that, but instead represent things with variable length, which is usually used in text.

workingloong · 2020-03-27T07:58:13Z

2. RaggedTensor

Thanks for your explanation.
Please notify us if there is design or PR to support sparse inputs for Embedding layer. And I would like to contribute to it.

tanzhenyu · 2020-03-27T10:14:32Z

RaggedTensor

Thanks for your explanation.
Please notify us if there is design or PR to support sparse inputs for Embedding layer. And I would like to contribute to it.

Of course, contribution is welcome! Can you make a PR for it?
Do note that using tf.nn.safe_embedding_lookup_sparse will probably be better than tf.nn.embedding_lookup_sparse, I have tried this when input rank is (None, None), it seems the former will return (None, None, Embedding_Dimension) while the latter will return (None, None)

ematejska · 2020-04-10T23:58:08Z

Looks like cannot update the pull request here anymore because the original branch/repo is not available. Will do the merge and approval in #209. Closing this one. Please see #209 for the notes from the design review in the RFC doc.

Add Keras categorical input RFC.

35e1b88

tanzhenyu requested review from brijk7, ewilderj, martinwicke and theadactyl as code owners December 13, 2019 00:09

googlebot added the cla: yes label Dec 13, 2019

martinwicke reviewed Dec 13, 2019

View reviewed changes

Addressing comments.

775e42f

This was referenced Dec 13, 2019

Unable to use FeatureColumn with Keras Functional API tensorflow/tensorflow#27416

Closed

RFC: Keras categorical input. keras-team/governance#15

Closed

Add some default sections.

de3f777

brijk7 added the RFC: Proposed RFC Design Document label Dec 13, 2019

brijk7 changed the title ~~RFC: Keras categorical input.~~ RFC: Keras categorical inputs Dec 13, 2019

ebrevdo suggested changes Dec 14, 2019

View reviewed changes

ebrevdo reviewed Dec 14, 2019

View reviewed changes

jsimsa reviewed Dec 18, 2019

View reviewed changes

jsimsa reviewed Dec 19, 2019

View reviewed changes

tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Dec 20, 2019

Create CategoryLookup Layer.

3270914

tensorflow/community#188 PiperOrigin-RevId: 286486505 Change-Id: I0fa15cb157076f86fd662215aedda6d5761d915d

workingloong mentioned this pull request Feb 10, 2020

Convert string values to integer IDs in data transformation. sql-machine-learning/elasticdl#1721

Closed

tanzhenyu mentioned this pull request Feb 11, 2020

export CombinerPreprocessingLayer and Combiner tensorflow/tensorflow#36117

Closed

haifeng-jin reviewed Feb 12, 2020

View reviewed changes

tanzhenyu mentioned this pull request Feb 23, 2020

RFC: Keras categorical inputs #209

Merged

ematejska requested review from ematejska and removed request for brijk7 February 24, 2020 17:20

ematejska added cla: yes and removed cla: no labels Feb 26, 2020

ematejska suggested changes Feb 28, 2020

View reviewed changes

rfcs/20191212-keras-categorical-inputs.md Show resolved Hide resolved

tanzhenyu mentioned this pull request Mar 14, 2020

Add concatenated_categorical_column in feature column api. tensorflow/tensorflow#37521

Closed

workingloong mentioned this pull request Mar 17, 2020

Support convert categorical features to SparseTensor for embedding in Keras sql-machine-learning/elasticdl#1844

Closed

tanzhenyu mentioned this pull request Mar 19, 2020

Cannot create feature_column from sparse tensor of feature counts tensorflow/tensorflow#37678

Closed

ematejska closed this Apr 10, 2020

ematejska removed the RFC: Proposed RFC Design Document label Apr 11, 2020


		model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])

		dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')


		Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).

		### Workflow 1

		combined into all combinations of output with degree of `depth`. For example,
		with inputs `a`, `b` and `c`, `depth=2` means the output will be [ab;ac;bc]


		```

		We also propose a `to_sparse` op to convert dense tensors to sparse tensors given user specified ignore values. This op can be used in both `tf.data` or [TF Transform](https://www.tensorflow.org/tfx/transform/get_started). In previous feature column world, "" is ignored for dense string input and -1 is ignored for dense int input.


		model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])

		dataset = tf.data.Dataset.from_tensor_slices((

RFC: Keras categorical inputs #188

RFC: Keras categorical inputs #188

Uh oh!

Conversation

tanzhenyu commented Dec 13, 2019 • edited by brijk7 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Keras categorical inputs

Objective

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brightcoder01 Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martinwicke left a comment

Choose a reason for hiding this comment

Uh oh!

tanzhenyu commented Dec 13, 2019

Uh oh!

ebrevdo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

tanzhenyu commented Dec 13, 2019 •

edited by brijk7

Loading

brightcoder01 Mar 17, 2020 •

edited

Loading