-
Notifications
You must be signed in to change notification settings - Fork 578
RFC: Keras categorical inputs #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy']) | ||
|
||
dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this is referenced before assignment? Does this code run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix applied.
Proposed: | ||
```python | ||
x = tf.keras.Input(shape=(1,), name=key, dtype=dtype) | ||
layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allowed to specify the hash function, this could also be folded into the CategoryHashing with an IdentityHash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CategoryHashing does not check oov values, so if we have that then it's complicating the signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you not use a layer to the lamda layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding a layer to do the work of categorical_column_with_identity
?
- The lambda expression is a little complicated
- We may need the layer to handle both dense tensor and SparseTensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add (or not delete) the other questions in the template? The API and workflow section is very good.
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_sparse is too global. i'd move it somewhere under keras. and name it something more specific.
TF2 has several notions of sparsity including SparseTensor, SparseMatrix (coming), and possibly others in the future.
|
||
```python | ||
`tf.keras.layers.CategoryLookup` | ||
CategoryLookup(PreprocessingLayer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't forget to implement the correct compute_output_signature
for these classes, since they will accept SparseTensorSpecs, and must emit SparseTensorSpecs in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder!
|
||
Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY). | ||
|
||
### Workflow 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to provide an example workflow where you have to get the vocabulary from e.g. a csv file using tf.data and in particular, tf.data.experimental.unique and tf.data.experimental.get_single_element to read out the tensor? @jsimsa wdyt?
This is gonna be very common, i think, in real use cases.
Proposed: | ||
```python | ||
x = tf.keras.Input(shape=(1,), name=key, dtype=dtype) | ||
layer = tf.keras.layers.CategoryLookup( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, so here you allow the user to provide a vocabulary file directly, so the tf.data example may not be necessary. May still be useful if users have vocab that needs to be munged a bit before reading directly. But less important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this layer is more of a complimentary to that, i.e., tf.data can parse records and generate vocab file, of read vocab file and do other processing and still return string tensors. This layer is taken from that and convert things to indices before it gets to embedding.
vocabulary: the vocabulary to lookup the input. If it is a file, it represents the | ||
source vocab file; If it is a list/tuple, it represents the source vocab | ||
list; If it is None, the vocabulary can later be set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the format of the file? how do you set the vocabulary later? what is the expected use of the adapt
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
format of the file is same as
a) any other TFX vocab file, or
b) this test file: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/feature_column/testdata/warriors_vocabulary.txt -
users from feature columns world will set it during init, but this layer also allow users to call 'adapt' to derive/set the vocabulary from dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that this should be documented. Stating the the vocabulary can be set later without showing how is not useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
`tf.keras.layers.CategoryCrossing` | ||
CategoryCrossing(PreprocessingLayer): | ||
"""This layer transforms multiple categorical inputs to categorical outputs | ||
by Cartesian product. and hash the output if necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove extra .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
CategoryCrossing(PreprocessingLayer): | ||
"""This layer transforms multiple categorical inputs to categorical outputs | ||
by Cartesian product. and hash the output if necessary. | ||
If any input is sparse, then output is sparse, otherwise dense.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOC, why is the wording here different than in the other API endpoints (it seems that the intended behavior is the same?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. This is the only layer that can accept multiple inputs. Other API only accept a single Tensor/SparseTensor.
So by multiple inputs, if any one of them is sparse, the output will be sparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Maybe you should say, "If any of the inputs is sparse, then all outputs will be sparse. Otherwise, all outputs will be dense."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's better. Done.
combined into all combinations of output with degree of `depth`. For example, | ||
with inputs `a`, `b` and `c`, `depth=2` means the output will be [ab;ac;bc] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the example should be moved to the "Example" section below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both Example for each layer description, and a code snippet below.
If the layer receives two inputs, `a=[[1, 2]]` and `b=[[1, 3]]`, | ||
and if depth is 2, then | ||
the output will be a single integer tensor `[[i, j, k, l]]`, where: | ||
i is the index of the category "a1=1 and b1=1" | ||
j is the index of the category "a1=1 and b2=3" | ||
k is the index of the category "a2=2 and b1=1" | ||
l is the index of the category "a2=2 and b2=3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this example. What is a1
vs a2
? What will the "single integer tensor [[i, j, k, l]]
" actually look like for the given inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it is confusing. Updated.
pass | ||
|
||
`tf.keras.layers.CategoryEncoding` | ||
CategoryEncoding(PreprocessingLayer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add example for this layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pass | ||
|
||
`tf.keras.layers.CategoryHashing` | ||
CategoryHashing(PreprocessingLayer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add example for this layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
``` | ||
|
||
We also propose a `to_sparse` op to convert dense tensors to sparse tensors given user specified ignore values. This op can be used in both `tf.data` or [TF Transform](https://www.tensorflow.org/tfx/transform/get_started). In previous feature column world, "" is ignored for dense string input and -1 is ignored for dense int input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if the SparseTensor
class had a from_dense
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't need the functionality of sparse_output = to_sparse(sparse_input)
, then from_dense
is probably better.
This "imagined" functionality is not used anywhere though. In TFT I think any tf.io.VarLenFeature should automatically be sparse input, we just need to call SparseTensor.from_dense for any tf.io.FixedLenFeature.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized, we already have from_dense, so perhaps you should just extend it with the option to set the element to be ignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point. I wasn't aware of this op. We should just extend it. Done.
```python | ||
`tf.to_sparse` | ||
def to_sparse(input, ignore_value): | ||
"""Convert dense/sparse tensor to sparse while dropping user specified values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of calling this API with a SparseTensor input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to allow users to filter specified values, e.g., if the original input is already sparse:
indices = [[0,0], [1, 0], [1,1]]
values = ['A', '', 'C']
the user can still filter '' from it,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This filtering can be built out of existing operations. You can call, tf.where on values
and pass the result to tf.sparse.retain, which is simple enough that I do not see the point of introducing syntactic sugar for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Let's just extend the tf.sparse.from_dense
op.
tensorflow/community#188 PiperOrigin-RevId: 286486505 Change-Id: I0fa15cb157076f86fd662215aedda6d5761d915d
Yeah I will update it soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanzhenyu The layers look great!
From AutoKeras perspective, I would like a workflow that would let the user input the CSV data they load to a single input node instead of many.
Hide the single column tensors inside the Keras Model.
Use one "decompose" layer after the input node to tear the CSV data into single column tensors and feed the categorical layers.
Is this possible?
vocab_list = sorted(dftrain[feature_name].unique()) | ||
# Map string values to indices | ||
x = tf.keras.layers.Lookup(vocabulary=vocab_list, name=feature_name)(feature_input) | ||
x = tf.keras.layers.Vectorize(num_categories=len(vocab_list))(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incorrect indent.
|
||
model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy']) | ||
|
||
dataset = tf.data.Dataset.from_tensor_slices(( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have a single input node and use some layer to decompose the input to these single column nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use tf.split for that?
A Googler has manually verified that the CLAs look good. (Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.) ℹ️ Googlers: Go here for more info. |
The CLA looks good. tanzhenyu signed the CLA but some of the commits were from his laptop acocunt. |
Yeah I created #209 as a workaround -- which one should we merge here? |
Let's continue in this one since it has the history and get this one merged. Also, do you have notes from the review meeting you could post here? |
It doesn't seem I could update this once the original branch is gone. What can we do to fix here? |
In the RFC, the preprocessing layers like |
That's a very good question. We're currently gathering use cases for supporting sparse with Embedding layer, |
Yes. Except In TF 2.1.0, I find that the |
Sorry for the delay.
|
Thanks for your explanation. |
Of course, contribution is welcome! Can you make a PR for it? |
Comment period is open till Dec 31, 2019.
Keras categorical inputs
Objective
This document proposes 4 new preprocessing Keras layers (
CategoryLookup
,CategoryCrossing
,CategoryEncoding
,CategoryHashing
), and 1 additional op (to_sparse
) to allow users to:tf.keras.layers.DenseFeatures
with proposed layers