Description
Why do we need to convert category values to IDs?
The date type of features in the table may be int, float or string. For example:
age | education | marital-status |
---|---|---|
34 | Master | Divorced |
54 | Doctor | Never-married |
42 | Bachelor | Never-married |
The data type of "age" is int, we can directly create a tensor [[34], [54], [42]] and use the tensor for dense layer in deep learning. However, the types of "education" and "marital-status" are string, we cannot use the string tensor for dense layer. So, we generally, use a vector to represent a category string value using embedding.
Although the ElasticDL embedding layer support looking up embedding vectors for string values, it cannot be exported to SavedModel. In the model serving design, we support exporting the model trained in ElasticDL using TensorFlow SavedModel for TF serving if the embedding size is not huge. We need to replace ElasticDL embedding with TensorFlow embedding in the trained model. However, the embedding inputs must be integer IDs for TensorFlow embedding. Then, we need to convert category feature values to IDs if we want to export TensorFlow Savedmodel. For example:
age | education | marital-status |
---|---|---|
34 | 0 | 0 |
54 | 1 | 1 |
42 | 2 | 1 |
Solution proposals to convert categorical values to IDs Using TensorFlow API.
There are 3 solutions to convert categorical values to IDs Using TensorFlow:
- Using categorical columns in
tf.feature_column
. - Customize transfrom_fn in
tf.feature_column.numeric_column
- Customize Keras layers to convert using TensorFlow OPs
1. Use categorical columns in tf.feature_column
, such as tf.feature_column.categorical_column_with_hash_bucket
, to convert category values to IDs.
hash_bucket_column = tf.feature_column.categorical_column_with_hash_bucket(
name, hash_bucket_size
)
But, the output of tf.feature_column.categorical_column_with_hash_bucket
is a sparse tensor which can not directly used in tf.keras.layers.DenseFeature
. So we must use tf.feature_column.embedding_column
or tf.feature_column.indicator_column
to convert tf.sparseTensor
to dense tensor.
embedded_column = tf.feature_column.embedding_column(hash_bucket_column, embedding_dim)
The example of keras model using categorical_column_with_hash_bucket
input_layers = [
tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
]
education_hash_column = tf.feature_column.categorical_column_with_hash_bucket(
name="education", hash_bucket_size=100
)
marital_hash_column = tf.feature_column.categorical_column_with_hash_bucket(
name="marital-status", hash_bucket_size=50
)
education_embedded_column = tf.feature_column.embedding_column(
education_hash_column, embedding_dim=10
)
marital_embedded_colum = tf.feature_column.embedding_column(
marital_hash_column, embedding_dim=4
)
education_embedded = tf.keras.layers.DenseFeatures([education_embedded_column])(input_layers)
marital_embedded = tf.keras.layers.DenseFeatures([marital_embedded_colum])(input_layers)
2. Customize a transformation function in tf.feature_column.numeric_column
to convert category values to IDs, like:
def generate_hash_bucket_column(name, hash_bucket_size):
def hash_bucket_id(x, hash_bucket_size):
if x.dtype is not tf.string:
x = tf.strings.as_string(x)
return tf.strings.to_hash_bucket_fast(x, hash_bucket_size)
transform_fn = lambda x, hash_bucket_size=hash_bucket_size : (
hash_bucket_id(x, hash_bucket_size)
)
return tf.feature_column.numeric_column(
name, dtype=tf.int32, normalizer_fn=transform_fn
)
The output of tf.feature_column.numeric_column
is a dense tensor with integer IDs which can directly used in tf.keras.layers.DenseFeatures
. If we want to make embeddings on those IDs, we can use tf.keras.layers.Embedding
after DenseFeatures
. Of course, we also can convert IDs by customed keras layers.
The model example:
input_layers = [
tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
]
education_hash = generate_hash_bucket_column(
name="education", hash_bucket_size=100
)
marital_hash = generate_hash_bucket_column(
name="marital-status", hash_bucket_size=50
)
education_hash_ids = tf.keras.layers.DenseFeatures([education_hash])(input_layers)
marital_hash_ids = tf.keras.layers.DenseFeatures([marital_hash])(input_layers)
education_embedded = tf.keras.layers.Embedding(100,8)(education_hash_ids)
marital_embedded = tf.keras.layers.Embedding(50,4)(marital_hash_ids)
3. Customize a Keras layer to convert category values to IDs.
class HashBucket(tf.keras.layers.Layer):
def __init__(self, hash_bucket_size):
super(HashBucket, self).__init__()
self.hash_bucket_size =hash_ bucket_size
def call(self, inputs):
if inputs.dtype is not tf.string:
inputs = tf.strings.as_string(inputs)
bucket_id = tf.strings.to_hash_bucket_fast(
inputs, self.hash_bucket_size
)
return tf.cast(bucket_id, tf.int64)
The output is also a dense tensor and we don't need to use tf.keras.layers.DenseFeature
. And Keras is developing preprocess layers for category feature and the APIs will be released in TF2.2. Then, we can user Keras.perprocess to replace the custom layers.
The model example:
education_input = tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
marital_input = tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
education_hash_ids = HashBucket(hash_bucket_size=100)(education_input)
marital_hash_ids = HashBucket(hash_bucket_size=50)(marital_input)
education_embedded = tf.keras.layers.Embedding(100,8)(education_hash_ids)
marital_embedded = tf.keras.layers.Embedding(50,4)(marital_hash_ids)