Skip to content

RFC: Keras categorical inputs #188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

RFC: Keras categorical inputs #188

wants to merge 5 commits into from

Conversation

tanzhenyu
Copy link
Contributor

@tanzhenyu tanzhenyu commented Dec 13, 2019

Comment period is open till Dec 31, 2019.

Keras categorical inputs

Status Proposed
RFC # 188
Author(s) Zhenyu Tan ([email protected]), Francois Chollet ([email protected])
Sponsor Karmel Allison ([email protected]), Martin Wicke ([email protected])
Updated 2019-12-12

Objective

This document proposes 4 new preprocessing Keras layers (CategoryLookup, CategoryCrossing, CategoryEncoding, CategoryHashing), and 1 additional op (to_sparse) to allow users to:

  • Perform feature engineering for categorical inputs
  • Replace feature columns and tf.keras.layers.DenseFeatures with proposed layers
  • Introduce sparse inputs that work with Keras linear models and other layers that support sparsity


model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])

dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this is referenced before assignment? Does this code run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix applied.

Proposed:
```python
x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)
layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allowed to specify the hash function, this could also be folded into the CategoryHashing with an IdentityHash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CategoryHashing does not check oov values, so if we have that then it's complicating the signature.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you not use a layer to the lamda layer?

Copy link

@brightcoder01 brightcoder01 Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a layer to do the work of categorical_column_with_identity?

  1. The lambda expression is a little complicated
  2. We may need the layer to handle both dense tensor and SparseTensor.

Copy link
Member

@martinwicke martinwicke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add (or not delete) the other questions in the template? The API and workflow section is very good.

@tanzhenyu
Copy link
Contributor Author

Can you add (or not delete) the other questions in the template? The API and workflow section is very good.

Done.

@brijk7 brijk7 added the RFC: Proposed RFC Design Document label Dec 13, 2019
@brijk7 brijk7 changed the title RFC: Keras categorical input. RFC: Keras categorical inputs Dec 13, 2019
Copy link
Contributor

@ebrevdo ebrevdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_sparse is too global. i'd move it somewhere under keras. and name it something more specific.

TF2 has several notions of sparsity including SparseTensor, SparseMatrix (coming), and possibly others in the future.


```python
`tf.keras.layers.CategoryLookup`
CategoryLookup(PreprocessingLayer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't forget to implement the correct compute_output_signature for these classes, since they will accept SparseTensorSpecs, and must emit SparseTensorSpecs in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder!


Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).

### Workflow 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to provide an example workflow where you have to get the vocabulary from e.g. a csv file using tf.data and in particular, tf.data.experimental.unique and tf.data.experimental.get_single_element to read out the tensor? @jsimsa wdyt?

This is gonna be very common, i think, in real use cases.

Proposed:
```python
x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)
layer = tf.keras.layers.CategoryLookup(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, so here you allow the user to provide a vocabulary file directly, so the tf.data example may not be necessary. May still be useful if users have vocab that needs to be munged a bit before reading directly. But less important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this layer is more of a complimentary to that, i.e., tf.data can parse records and generate vocab file, of read vocab file and do other processing and still return string tensors. This layer is taken from that and convert things to indices before it gets to embedding.

Comment on lines 209 to 211
vocabulary: the vocabulary to lookup the input. If it is a file, it represents the
source vocab file; If it is a list/tuple, it represents the source vocab
list; If it is None, the vocabulary can later be set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the format of the file? how do you set the vocabulary later? what is the expected use of the adapt method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. format of the file is same as
    a) any other TFX vocab file, or
    b) this test file: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/feature_column/testdata/warriors_vocabulary.txt

  2. users from feature columns world will set it during init, but this layer also allow users to call 'adapt' to derive/set the vocabulary from dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that this should be documented. Stating the the vocabulary can be set later without showing how is not useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

`tf.keras.layers.CategoryCrossing`
CategoryCrossing(PreprocessingLayer):
"""This layer transforms multiple categorical inputs to categorical outputs
by Cartesian product. and hash the output if necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove extra .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

CategoryCrossing(PreprocessingLayer):
"""This layer transforms multiple categorical inputs to categorical outputs
by Cartesian product. and hash the output if necessary.
If any input is sparse, then output is sparse, otherwise dense."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC, why is the wording here different than in the other API endpoints (it seems that the intended behavior is the same?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This is the only layer that can accept multiple inputs. Other API only accept a single Tensor/SparseTensor.
So by multiple inputs, if any one of them is sparse, the output will be sparse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Maybe you should say, "If any of the inputs is sparse, then all outputs will be sparse. Otherwise, all outputs will be dense."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's better. Done.

Comment on lines +237 to +238
combined into all combinations of output with degree of `depth`. For example,
with inputs `a`, `b` and `c`, `depth=2` means the output will be [ab;ac;bc]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example should be moved to the "Example" section below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Example for each layer description, and a code snippet below.

Comment on lines 248 to 254
If the layer receives two inputs, `a=[[1, 2]]` and `b=[[1, 3]]`,
and if depth is 2, then
the output will be a single integer tensor `[[i, j, k, l]]`, where:
i is the index of the category "a1=1 and b1=1"
j is the index of the category "a1=1 and b2=3"
k is the index of the category "a2=2 and b1=1"
l is the index of the category "a2=2 and b2=3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this example. What is a1 vs a2? What will the "single integer tensor [[i, j, k, l]]" actually look like for the given inputs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is confusing. Updated.

pass

`tf.keras.layers.CategoryEncoding`
CategoryEncoding(PreprocessingLayer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add example for this layer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pass

`tf.keras.layers.CategoryHashing`
CategoryHashing(PreprocessingLayer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add example for this layer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


```

We also propose a `to_sparse` op to convert dense tensors to sparse tensors given user specified ignore values. This op can be used in both `tf.data` or [TF Transform](https://www.tensorflow.org/tfx/transform/get_started). In previous feature column world, "" is ignored for dense string input and -1 is ignored for dense int input.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if the SparseTensor class had a from_dense method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't need the functionality of sparse_output = to_sparse(sparse_input), then from_dense is probably better.
This "imagined" functionality is not used anywhere though. In TFT I think any tf.io.VarLenFeature should automatically be sparse input, we just need to call SparseTensor.from_dense for any tf.io.FixedLenFeature.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized, we already have from_dense, so perhaps you should just extend it with the option to set the element to be ignore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point. I wasn't aware of this op. We should just extend it. Done.

```python
`tf.to_sparse`
def to_sparse(input, ignore_value):
"""Convert dense/sparse tensor to sparse while dropping user specified values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of calling this API with a SparseTensor input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to allow users to filter specified values, e.g., if the original input is already sparse:
indices = [[0,0], [1, 0], [1,1]]
values = ['A', '', 'C']
the user can still filter '' from it,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filtering can be built out of existing operations. You can call, tf.where on values and pass the result to tf.sparse.retain, which is simple enough that I do not see the point of introducing syntactic sugar for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Let's just extend the tf.sparse.from_dense op.

tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Dec 20, 2019
tensorflow/community#188

PiperOrigin-RevId: 286486505
Change-Id: I0fa15cb157076f86fd662215aedda6d5761d915d
@tanzhenyu
Copy link
Contributor Author

@tanzhenyu were there notes from the design review meeting that could be posted here?

Yeah I will update it soon.

Copy link

@haifeng-jin haifeng-jin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanzhenyu The layers look great!

From AutoKeras perspective, I would like a workflow that would let the user input the CSV data they load to a single input node instead of many.
Hide the single column tensors inside the Keras Model.

Use one "decompose" layer after the input node to tear the CSV data into single column tensors and feed the categorical layers.

Is this possible?

vocab_list = sorted(dftrain[feature_name].unique())
# Map string values to indices
x = tf.keras.layers.Lookup(vocabulary=vocab_list, name=feature_name)(feature_input)
x = tf.keras.layers.Vectorize(num_categories=len(vocab_list))(x)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect indent.


model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])

dataset = tf.data.Dataset.from_tensor_slices((

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have a single input node and use some layer to decompose the input to these single column nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tf.split for that?

@ematejska ematejska requested review from ematejska and removed request for brijk7 February 24, 2020 17:20
@googlebot
Copy link

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

@ematejska
Copy link

The CLA looks good. tanzhenyu signed the CLA but some of the commits were from his laptop acocunt.

@tanzhenyu
Copy link
Contributor Author

The CLA looks good. tanzhenyu signed the CLA but some of the commits were from his laptop acocunt.

Yeah I created #209 as a workaround -- which one should we merge here?

@ematejska
Copy link

Let's continue in this one since it has the history and get this one merged. Also, do you have notes from the review meeting you could post here?

@tanzhenyu
Copy link
Contributor Author

Let's continue in this one since it has the history and get this one merged. Also, do you have notes from the review meeting you could post here?

It doesn't seem I could update this once the original branch is gone. What can we do to fix here?

@workingloong
Copy link

With tf.feature_column, we can use embedding_column to wrap category_column and convert category_column output to dense tensor. Whether can Keras Category layers support the function like embedding_column?
Meanwhile, tf.keras.layers.Embedding cannot support sparseTensor inputs which may be the output of CategoryLookup. I have created an issue tensorflow/tensorflow#33880 to embedding lookup with sparseTensor.

You can use dense tensor input to CategoryLookup, which gives you dense tensor output, and feed that into tf.keras.layers.Embedding.

Maybe we should support sparse input in embedding layer.

In the RFC, the preprocessing layers like Lookup and FingerPrint support SparseTenor in and out. But now, we cannot feed the output SparseTensor into tf.keras.layers.Embedding. Will the embedding with sparse input be released with those preprocessing layers?

@tanzhenyu
Copy link
Contributor Author

With tf.feature_column, we can use embedding_column to wrap category_column and convert category_column output to dense tensor. Whether can Keras Category layers support the function like embedding_column?
Meanwhile, tf.keras.layers.Embedding cannot support sparseTensor inputs which may be the output of CategoryLookup. I have created an issue tensorflow/tensorflow#33880 to embedding lookup with sparseTensor.

You can use dense tensor input to CategoryLookup, which gives you dense tensor output, and feed that into tf.keras.layers.Embedding.
Maybe we should support sparse input in embedding layer.

In the RFC, the preprocessing layers like Lookup and FingerPrint support SparseTenor in and out. But now, we cannot feed the output SparseTensor into tf.keras.layers.Embedding. Will the embedding with sparse input be released with those preprocessing layers?

That's a very good question. We're currently gathering use cases for supporting sparse with Embedding layer, Lookup is definitely the most important one, but if you have other use cases, would you mind sharing with us?

@workingloong
Copy link

workingloong commented Mar 17, 2020

That's a very good question. We're currently gathering use cases for supporting sparse with Embedding layer, Lookup is definitely the most important one, but if you have other use cases, would you mind sharing with us?

Yes. Except Lookup, Hashing and Bucketize which support sparse in and sparse out should be use cases. In ElasticDL, we are developing some preprocessing layers like RoundIdentity. What's more, we sometimes need to convert the inputs to sparse tensor ignoring missing value and transform the sparse tensor using Keras preprocessing layers. The detail is in the ElasticDL issue.

In TF 2.1.0, I find that the tf.keras.layers.Embedding has supported RaggedTensor and output RaggedTensor with embedding vectors. In my opinion, the function of RaggedTensor is the same as SparseTensor. And we can use tf.keras.layers.Embedding with ragged tensor and tf.reduce operators like tf.reduce_sum to implement the feature of tf.nn.safe_embedding_lookup_sparse with combiner. Except tf.keras.layers.Embedding, the tf.keras.layers.Concatenate also support RaggedTensor but not SparseTensor. How do you think about the difference and relationship between SparseTensor and RaggedTensor along with embedding?

@tanzhenyu
Copy link
Contributor Author

That's a very good question. We're currently gathering use cases for supporting sparse with Embedding layer, Lookup is definitely the most important one, but if you have other use cases, would you mind sharing with us?

Yes. Except Lookup, Hashing and Bucketize which support sparse in and sparse out should be use cases. In ElasticDL, we are developing some preprocessing layers like RoundIdentity. What's more, we sometimes need to convert the inputs to sparse tensor ignoring missing value and transform the sparse tensor using Keras preprocessing layers. The detail is in the ElasticDL issue.

In TF 2.1.0, I find that the tf.keras.layers.Embedding has supported RaggedTensor and output RaggedTensor with embedding vectors. In my opinion, the function of RaggedTensor is the same as SparseTensor. And we can use tf.keras.layers.Embedding with ragged tensor and tf.reduce operators like tf.reduce_sum to implement the feature of tf.nn.safe_embedding_lookup_sparse with combiner. Except tf.keras.layers.Embedding, the tf.keras.layers.Concatenate also support RaggedTensor but not SparseTensor. How do you think about the difference and relationship between SparseTensor and RaggedTensor along with embedding?

Sorry for the delay.

  1. Yep I think we should support sparse inputs for Embedding layer.
  2. The major difference between SparseTensor and RaggedTensor is the former can represent things with missing value, i.e.:
    [[0, N/A, 1 ]
    [2, 1, N/A]]
    which is usually used in structured data, while the latter cannot do that, but instead represent things with variable length, which is usually used in text.

@workingloong
Copy link

2. RaggedTensor

Thanks for your explanation.
Please notify us if there is design or PR to support sparse inputs for Embedding layer. And I would like to contribute to it.

@tanzhenyu
Copy link
Contributor Author

  1. RaggedTensor

Thanks for your explanation.
Please notify us if there is design or PR to support sparse inputs for Embedding layer. And I would like to contribute to it.

Of course, contribution is welcome! Can you make a PR for it?
Do note that using tf.nn.safe_embedding_lookup_sparse will probably be better than tf.nn.embedding_lookup_sparse, I have tried this when input rank is (None, None), it seems the former will return (None, None, Embedding_Dimension) while the latter will return (None, None)

@ematejska
Copy link

Looks like cannot update the pull request here anymore because the original branch/repo is not available. Will do the merge and approval in #209. Closing this one. Please see #209 for the notes from the design review in the RFC doc.

@ematejska ematejska closed this Apr 10, 2020
@ematejska ematejska removed the RFC: Proposed RFC Design Document label Apr 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.