Convert string values to integer IDs in data transformation.

### Why do we need to convert category values to IDs?
The date type of features in the table may be int, float or string. For example:

|  age | education | marital-status |
| ---- | --- | --- |
|  34  | Master | Divorced |
|  54  | Doctor | Never-married |
|  42  | Bachelor | Never-married |

The data type of "age" is int, we can directly create a tensor [[34], [54], [42]] and use the tensor for dense layer in deep learning. However, the types of "education" and "marital-status" are string, we cannot use the string tensor for dense layer. So, we generally, use a vector to represent a category string value using embedding.

Although the ElasticDL embedding layer support looking up embedding vectors for string values, it cannot be exported to SavedModel. In the [model serving design](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/model_serving.md), we support exporting the model trained in ElasticDL using TensorFlow SavedModel for TF serving if the embedding size is not huge. We need to replace ElasticDL embedding with TensorFlow embedding in the trained model. However, the embedding inputs must be integer IDs for TensorFlow embedding. Then, we need to convert category feature values to IDs if we want to export TensorFlow Savedmodel. For example:

|  age | education | marital-status |
| ---- | --- | --- |
|  34  | 0 | 0 |
|  54  | 1 | 1 |
|  42  | 2 | 1 |

### Solution proposals to convert categorical values to IDs Using TensorFlow API.
There are 3 solutions to convert categorical values to IDs Using TensorFlow:
*  Using categorical columns in `tf.feature_column`.
* Customize transfrom_fn in `tf.feature_column.numeric_column`
* Customize Keras layers to convert using TensorFlow OPs

#### 1.  Use categorical columns in `tf.feature_column`, such as `tf.feature_column.categorical_column_with_hash_bucket`, to convert category values to IDs. 

```python
hash_bucket_column = tf.feature_column.categorical_column_with_hash_bucket(
    name, hash_bucket_size
)
```

But, the output of `tf.feature_column.categorical_column_with_hash_bucket` is a [sparse tensor](https://www.tensorflow.org/api_docs/python/tf/sparse/SparseTensor)  which can not directly used in `tf.keras.layers.DenseFeature`. So we must use `tf.feature_column.embedding_column` or `tf.feature_column.indicator_column` to convert `tf.sparseTensor` to dense tensor.

```python
embedded_column = tf.feature_column.embedding_column(hash_bucket_column, embedding_dim)
```
The example of keras model using `categorical_column_with_hash_bucket`

```python
input_layers = [
	tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
	tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
]
education_hash_column = tf.feature_column.categorical_column_with_hash_bucket(
	name="education", hash_bucket_size=100
)
marital_hash_column = tf.feature_column.categorical_column_with_hash_bucket(
	name="marital-status", hash_bucket_size=50
)

education_embedded_column = tf.feature_column.embedding_column(
	education_hash_column, embedding_dim=10
)
marital_embedded_colum = tf.feature_column.embedding_column(
	marital_hash_column, embedding_dim=4
)
education_embedded = tf.keras.layers.DenseFeatures([education_embedded_column])(input_layers)
marital_embedded = tf.keras.layers.DenseFeatures([marital_embedded_colum])(input_layers)
```

#### 2. Customize a transformation function in `tf.feature_column.numeric_column` to convert category values to IDs, like:

```python
def generate_hash_bucket_column(name, hash_bucket_size):
    def hash_bucket_id(x, hash_bucket_size):
        if x.dtype is not tf.string:
            x = tf.strings.as_string(x)
        return tf.strings.to_hash_bucket_fast(x, hash_bucket_size)

    transform_fn = lambda x, hash_bucket_size=hash_bucket_size : (
        hash_bucket_id(x, hash_bucket_size)
    )
    return tf.feature_column.numeric_column(
        name, dtype=tf.int32, normalizer_fn=transform_fn
    )
```
The output of `tf.feature_column.numeric_column` is a dense tensor with integer IDs which can directly used in `tf.keras.layers.DenseFeatures`. If we want to make embeddings on those IDs, we can use `tf.keras.layers.Embedding` after `DenseFeatures`. Of course, we also can convert IDs by customed keras layers.
 
The model example:

```python
input_layers = [
	tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
	tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
]
education_hash = generate_hash_bucket_column(
	name="education", hash_bucket_size=100
)
marital_hash = generate_hash_bucket_column(
	name="marital-status", hash_bucket_size=50
)

education_hash_ids = tf.keras.layers.DenseFeatures([education_hash])(input_layers)
marital_hash_ids = tf.keras.layers.DenseFeatures([marital_hash])(input_layers)

education_embedded = tf.keras.layers.Embedding(100,8)(education_hash_ids)
marital_embedded = tf.keras.layers.Embedding(50,4)(marital_hash_ids)
```

#### 3. Customize a Keras layer to convert category values to IDs.

```python
class HashBucket(tf.keras.layers.Layer):
    def __init__(self, hash_bucket_size):
        super(HashBucket, self).__init__()
        self.hash_bucket_size =hash_ bucket_size

    def call(self, inputs):
        if inputs.dtype is not tf.string:
            inputs = tf.strings.as_string(inputs)
        bucket_id = tf.strings.to_hash_bucket_fast(
        	inputs, self.hash_bucket_size
        )
        return tf.cast(bucket_id, tf.int64)
```
The output is also a dense tensor and we don't need to use `tf.keras.layers.DenseFeature`. And Keras is developing [preprocess layers](https://github.com/tensorflow/community/pull/188) for category feature and the APIs will be released in TF2.2. Then, we can user Keras.perprocess to replace the custom layers.

The model example:

```python
education_input = tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
marital_input = tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)

education_hash_ids = HashBucket(hash_bucket_size=100)(education_input)
marital_hash_ids = HashBucket(hash_bucket_size=50)(marital_input)

education_embedded = tf.keras.layers.Embedding(100,8)(education_hash_ids)
marital_embedded = tf.keras.layers.Embedding(50,4)(marital_hash_ids)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convert string values to integer IDs in data transformation. #1721

Why do we need to convert category values to IDs?

Solution proposals to convert categorical values to IDs Using TensorFlow API.

1. Use categorical columns in `tf.feature_column`, such as `tf.feature_column.categorical_column_with_hash_bucket`, to convert category values to IDs.

2. Customize a transformation function in `tf.feature_column.numeric_column` to convert category values to IDs, like:

3. Customize a Keras layer to convert category values to IDs.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

age	education	marital-status
34	Master	Divorced
54	Doctor	Never-married
42	Bachelor	Never-married

Convert string values to integer IDs in data transformation. #1721

Description

Why do we need to convert category values to IDs?

Solution proposals to convert categorical values to IDs Using TensorFlow API.

1. Use categorical columns in tf.feature_column, such as tf.feature_column.categorical_column_with_hash_bucket, to convert category values to IDs.

2. Customize a transformation function in tf.feature_column.numeric_column to convert category values to IDs, like:

3. Customize a Keras layer to convert category values to IDs.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Use categorical columns in `tf.feature_column`, such as `tf.feature_column.categorical_column_with_hash_bucket`, to convert category values to IDs.

2. Customize a transformation function in `tf.feature_column.numeric_column` to convert category values to IDs, like: