Skip to content

Convert string values to integer IDs in data transformation. #1721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
workingloong opened this issue Feb 10, 2020 · 1 comment
Closed

Convert string values to integer IDs in data transformation. #1721

workingloong opened this issue Feb 10, 2020 · 1 comment

Comments

@workingloong
Copy link
Collaborator

workingloong commented Feb 10, 2020

Why do we need to convert category values to IDs?

The date type of features in the table may be int, float or string. For example:

age education marital-status
34 Master Divorced
54 Doctor Never-married
42 Bachelor Never-married

The data type of "age" is int, we can directly create a tensor [[34], [54], [42]] and use the tensor for dense layer in deep learning. However, the types of "education" and "marital-status" are string, we cannot use the string tensor for dense layer. So, we generally, use a vector to represent a category string value using embedding.

Although the ElasticDL embedding layer support looking up embedding vectors for string values, it cannot be exported to SavedModel. In the model serving design, we support exporting the model trained in ElasticDL using TensorFlow SavedModel for TF serving if the embedding size is not huge. We need to replace ElasticDL embedding with TensorFlow embedding in the trained model. However, the embedding inputs must be integer IDs for TensorFlow embedding. Then, we need to convert category feature values to IDs if we want to export TensorFlow Savedmodel. For example:

age education marital-status
34 0 0
54 1 1
42 2 1

Solution proposals to convert categorical values to IDs Using TensorFlow API.

There are 3 solutions to convert categorical values to IDs Using TensorFlow:

  • Using categorical columns in tf.feature_column.
  • Customize transfrom_fn in tf.feature_column.numeric_column
  • Customize Keras layers to convert using TensorFlow OPs

1. Use categorical columns in tf.feature_column, such as tf.feature_column.categorical_column_with_hash_bucket, to convert category values to IDs.

hash_bucket_column = tf.feature_column.categorical_column_with_hash_bucket(
    name, hash_bucket_size
)

But, the output of tf.feature_column.categorical_column_with_hash_bucket is a sparse tensor which can not directly used in tf.keras.layers.DenseFeature. So we must use tf.feature_column.embedding_column or tf.feature_column.indicator_column to convert tf.sparseTensor to dense tensor.

embedded_column = tf.feature_column.embedding_column(hash_bucket_column, embedding_dim)

The example of keras model using categorical_column_with_hash_bucket

input_layers = [
	tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
	tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
]
education_hash_column = tf.feature_column.categorical_column_with_hash_bucket(
	name="education", hash_bucket_size=100
)
marital_hash_column = tf.feature_column.categorical_column_with_hash_bucket(
	name="marital-status", hash_bucket_size=50
)

education_embedded_column = tf.feature_column.embedding_column(
	education_hash_column, embedding_dim=10
)
marital_embedded_colum = tf.feature_column.embedding_column(
	marital_hash_column, embedding_dim=4
)
education_embedded = tf.keras.layers.DenseFeatures([education_embedded_column])(input_layers)
marital_embedded = tf.keras.layers.DenseFeatures([marital_embedded_colum])(input_layers)

2. Customize a transformation function in tf.feature_column.numeric_column to convert category values to IDs, like:

def generate_hash_bucket_column(name, hash_bucket_size):
    def hash_bucket_id(x, hash_bucket_size):
        if x.dtype is not tf.string:
            x = tf.strings.as_string(x)
        return tf.strings.to_hash_bucket_fast(x, hash_bucket_size)

    transform_fn = lambda x, hash_bucket_size=hash_bucket_size : (
        hash_bucket_id(x, hash_bucket_size)
    )
    return tf.feature_column.numeric_column(
        name, dtype=tf.int32, normalizer_fn=transform_fn
    )

The output of tf.feature_column.numeric_column is a dense tensor with integer IDs which can directly used in tf.keras.layers.DenseFeatures. If we want to make embeddings on those IDs, we can use tf.keras.layers.Embedding after DenseFeatures. Of course, we also can convert IDs by customed keras layers.

The model example:

input_layers = [
	tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
	tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
]
education_hash = generate_hash_bucket_column(
	name="education", hash_bucket_size=100
)
marital_hash = generate_hash_bucket_column(
	name="marital-status", hash_bucket_size=50
)

education_hash_ids = tf.keras.layers.DenseFeatures([education_hash])(input_layers)
marital_hash_ids = tf.keras.layers.DenseFeatures([marital_hash])(input_layers)

education_embedded = tf.keras.layers.Embedding(100,8)(education_hash_ids)
marital_embedded = tf.keras.layers.Embedding(50,4)(marital_hash_ids)

3. Customize a Keras layer to convert category values to IDs.

class HashBucket(tf.keras.layers.Layer):
    def __init__(self, hash_bucket_size):
        super(HashBucket, self).__init__()
        self.hash_bucket_size =hash_ bucket_size

    def call(self, inputs):
        if inputs.dtype is not tf.string:
            inputs = tf.strings.as_string(inputs)
        bucket_id = tf.strings.to_hash_bucket_fast(
        	inputs, self.hash_bucket_size
        )
        return tf.cast(bucket_id, tf.int64)

The output is also a dense tensor and we don't need to use tf.keras.layers.DenseFeature. And Keras is developing preprocess layers for category feature and the APIs will be released in TF2.2. Then, we can user Keras.perprocess to replace the custom layers.

The model example:

education_input = tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
marital_input = tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)

education_hash_ids = HashBucket(hash_bucket_size=100)(education_input)
marital_hash_ids = HashBucket(hash_bucket_size=50)(marital_input)

education_embedded = tf.keras.layers.Embedding(100,8)(education_hash_ids)
marital_embedded = tf.keras.layers.Embedding(50,4)(marital_hash_ids)
@workingloong workingloong changed the title How to convert category values to integer IDs in data transformation for SQLFlow. How to convert category values to integer IDs in data transformation. Feb 11, 2020
@workingloong workingloong changed the title How to convert category values to integer IDs in data transformation. Convert category values to integer IDs in data transformation. Feb 12, 2020
@workingloong
Copy link
Collaborator Author

I vote the 3rd solution of Customizing a Keras layer to convert category values to IDs for the reasons:

  1. The former 2 solutions, users must study both feature columns API and Keras API. Using the 3rd solution, users only need to study the Keras API.
  2. Keras is developing preprocessing Keras layers to replace feature columns and tf.keras.layers.DenseFeatures with proposed layers in the RFC.
  3. Using the 1st solution, we must set a combiner for embedding_column. The combiner will combine the vectors of values using mean, sum or sqrt to a vectors. Sometimes, we don't need to combine the vectors like the embedding in DeepFM.
    def custom_model(
    input_dim=5383, embedding_dim=64, input_length=10, fc_unit=64
    ):
    inputs = tf.keras.Input(shape=(input_length,))
    embed_layer = Embedding(
    input_dim=input_dim,
    output_dim=embedding_dim,
    mask_zero=True,
    input_length=input_length,
    )
    embeddings = embed_layer(inputs)
    embeddings = ApplyMask()(embeddings)
    emb_sum = K.sum(embeddings, axis=1)
    emb_sum_square = K.square(emb_sum)
    emb_square = K.square(embeddings)
    emb_square_sum = K.sum(emb_square, axis=1)
    second_order = K.sum(

@workingloong workingloong changed the title Convert category values to integer IDs in data transformation. Convert string values to integer IDs in data transformation. Feb 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant