-
Notifications
You must be signed in to change notification settings - Fork 115
Convert string values to integer IDs in data transformation. #1721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
I vote the 3rd solution of Customizing a Keras layer to convert category values to IDs for the reasons:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Why do we need to convert category values to IDs?
The date type of features in the table may be int, float or string. For example:
The data type of "age" is int, we can directly create a tensor [[34], [54], [42]] and use the tensor for dense layer in deep learning. However, the types of "education" and "marital-status" are string, we cannot use the string tensor for dense layer. So, we generally, use a vector to represent a category string value using embedding.
Although the ElasticDL embedding layer support looking up embedding vectors for string values, it cannot be exported to SavedModel. In the model serving design, we support exporting the model trained in ElasticDL using TensorFlow SavedModel for TF serving if the embedding size is not huge. We need to replace ElasticDL embedding with TensorFlow embedding in the trained model. However, the embedding inputs must be integer IDs for TensorFlow embedding. Then, we need to convert category feature values to IDs if we want to export TensorFlow Savedmodel. For example:
Solution proposals to convert categorical values to IDs Using TensorFlow API.
There are 3 solutions to convert categorical values to IDs Using TensorFlow:
tf.feature_column
.tf.feature_column.numeric_column
1. Use categorical columns in
tf.feature_column
, such astf.feature_column.categorical_column_with_hash_bucket
, to convert category values to IDs.But, the output of
tf.feature_column.categorical_column_with_hash_bucket
is a sparse tensor which can not directly used intf.keras.layers.DenseFeature
. So we must usetf.feature_column.embedding_column
ortf.feature_column.indicator_column
to converttf.sparseTensor
to dense tensor.The example of keras model using
categorical_column_with_hash_bucket
2. Customize a transformation function in
tf.feature_column.numeric_column
to convert category values to IDs, like:The output of
tf.feature_column.numeric_column
is a dense tensor with integer IDs which can directly used intf.keras.layers.DenseFeatures
. If we want to make embeddings on those IDs, we can usetf.keras.layers.Embedding
afterDenseFeatures
. Of course, we also can convert IDs by customed keras layers.The model example:
3. Customize a Keras layer to convert category values to IDs.
The output is also a dense tensor and we don't need to use
tf.keras.layers.DenseFeature
. And Keras is developing preprocess layers for category feature and the APIs will be released in TF2.2. Then, we can user Keras.perprocess to replace the custom layers.The model example:
The text was updated successfully, but these errors were encountered: