Skip to content

Convert string values to integer IDs in data transformation. #1721

Closed
@workingloong

Description

@workingloong

Why do we need to convert category values to IDs?

The date type of features in the table may be int, float or string. For example:

age education marital-status
34 Master Divorced
54 Doctor Never-married
42 Bachelor Never-married

The data type of "age" is int, we can directly create a tensor [[34], [54], [42]] and use the tensor for dense layer in deep learning. However, the types of "education" and "marital-status" are string, we cannot use the string tensor for dense layer. So, we generally, use a vector to represent a category string value using embedding.

Although the ElasticDL embedding layer support looking up embedding vectors for string values, it cannot be exported to SavedModel. In the model serving design, we support exporting the model trained in ElasticDL using TensorFlow SavedModel for TF serving if the embedding size is not huge. We need to replace ElasticDL embedding with TensorFlow embedding in the trained model. However, the embedding inputs must be integer IDs for TensorFlow embedding. Then, we need to convert category feature values to IDs if we want to export TensorFlow Savedmodel. For example:

age education marital-status
34 0 0
54 1 1
42 2 1

Solution proposals to convert categorical values to IDs Using TensorFlow API.

There are 3 solutions to convert categorical values to IDs Using TensorFlow:

  • Using categorical columns in tf.feature_column.
  • Customize transfrom_fn in tf.feature_column.numeric_column
  • Customize Keras layers to convert using TensorFlow OPs

1. Use categorical columns in tf.feature_column, such as tf.feature_column.categorical_column_with_hash_bucket, to convert category values to IDs.

hash_bucket_column = tf.feature_column.categorical_column_with_hash_bucket(
    name, hash_bucket_size
)

But, the output of tf.feature_column.categorical_column_with_hash_bucket is a sparse tensor which can not directly used in tf.keras.layers.DenseFeature. So we must use tf.feature_column.embedding_column or tf.feature_column.indicator_column to convert tf.sparseTensor to dense tensor.

embedded_column = tf.feature_column.embedding_column(hash_bucket_column, embedding_dim)

The example of keras model using categorical_column_with_hash_bucket

input_layers = [
	tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
	tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
]
education_hash_column = tf.feature_column.categorical_column_with_hash_bucket(
	name="education", hash_bucket_size=100
)
marital_hash_column = tf.feature_column.categorical_column_with_hash_bucket(
	name="marital-status", hash_bucket_size=50
)

education_embedded_column = tf.feature_column.embedding_column(
	education_hash_column, embedding_dim=10
)
marital_embedded_colum = tf.feature_column.embedding_column(
	marital_hash_column, embedding_dim=4
)
education_embedded = tf.keras.layers.DenseFeatures([education_embedded_column])(input_layers)
marital_embedded = tf.keras.layers.DenseFeatures([marital_embedded_colum])(input_layers)

2. Customize a transformation function in tf.feature_column.numeric_column to convert category values to IDs, like:

def generate_hash_bucket_column(name, hash_bucket_size):
    def hash_bucket_id(x, hash_bucket_size):
        if x.dtype is not tf.string:
            x = tf.strings.as_string(x)
        return tf.strings.to_hash_bucket_fast(x, hash_bucket_size)

    transform_fn = lambda x, hash_bucket_size=hash_bucket_size : (
        hash_bucket_id(x, hash_bucket_size)
    )
    return tf.feature_column.numeric_column(
        name, dtype=tf.int32, normalizer_fn=transform_fn
    )

The output of tf.feature_column.numeric_column is a dense tensor with integer IDs which can directly used in tf.keras.layers.DenseFeatures. If we want to make embeddings on those IDs, we can use tf.keras.layers.Embedding after DenseFeatures. Of course, we also can convert IDs by customed keras layers.

The model example:

input_layers = [
	tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
	tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
]
education_hash = generate_hash_bucket_column(
	name="education", hash_bucket_size=100
)
marital_hash = generate_hash_bucket_column(
	name="marital-status", hash_bucket_size=50
)

education_hash_ids = tf.keras.layers.DenseFeatures([education_hash])(input_layers)
marital_hash_ids = tf.keras.layers.DenseFeatures([marital_hash])(input_layers)

education_embedded = tf.keras.layers.Embedding(100,8)(education_hash_ids)
marital_embedded = tf.keras.layers.Embedding(50,4)(marital_hash_ids)

3. Customize a Keras layer to convert category values to IDs.

class HashBucket(tf.keras.layers.Layer):
    def __init__(self, hash_bucket_size):
        super(HashBucket, self).__init__()
        self.hash_bucket_size =hash_ bucket_size

    def call(self, inputs):
        if inputs.dtype is not tf.string:
            inputs = tf.strings.as_string(inputs)
        bucket_id = tf.strings.to_hash_bucket_fast(
        	inputs, self.hash_bucket_size
        )
        return tf.cast(bucket_id, tf.int64)

The output is also a dense tensor and we don't need to use tf.keras.layers.DenseFeature. And Keras is developing preprocess layers for category feature and the APIs will be released in TF2.2. Then, we can user Keras.perprocess to replace the custom layers.

The model example:

education_input = tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
marital_input = tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)

education_hash_ids = HashBucket(hash_bucket_size=100)(education_input)
marital_hash_ids = HashBucket(hash_bucket_size=50)(marital_input)

education_embedded = tf.keras.layers.Embedding(100,8)(education_hash_ids)
marital_embedded = tf.keras.layers.Embedding(50,4)(marital_hash_ids)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions