Design SQLFlow syntax extension for data transform. #1664

brightcoder01 · 2020-01-19T08:21:03Z

The root of the discussion series is #1670

The following transform functions are common used. We can support these in the first stage.

Name	Transformation	Statitical Parameter	Input Type	Output Type
NORMALIZE(x)	Scale the inputs to the range [0, 1]. `out = x - x_min / (x_max - x_min)`	x_min, x_max	number	float64
STANDARDIZE(x)	Scale the inputs to z-score subtracts out the mean and divides by standard deviation. `out = x - x_mean / x_stddev`	x_mean, x_stddev	number	float64
BUCKETIZE(x, num_buckets, boundaries)	Transform the numeric features into categorical ids using a set of thresholds.	boundaries	Number	int64
HASH_BUCKET(x, hash_bucket_size)	Map the inputs into a finite number of buckets by hashing. `out_id = Hash(input_feature) % bucket_size`	hash_bucket_size	string, int32, int64	int64
VOCABULARIZE(x)	Map the inputs to integer ids by looking up the vocabulary	vocabulary_list	string, int32, int64	int64
EMBEDDING(x, dimension)	Map the inputs to embedding vectors	N/A	int32, int64	float32
CROSS(x1, x2, ..., xn, hash_bucket_size)	Hash(cartesian product of features) % hash_bucket_size	N/A	string, number	int64
CONCAT(x1, x2, ..., xn)	Concatenate multiple tensors representing categorical ids into one tensor.	N/A	int32, int64	int64

There are three options for the style of the generated transform code:

Feature Column API. Integrate it with model definition using tf.keras.layers.DenseFeatures;
Customized Keras Layer provided from ElasticDL. The functionality should cover all the common used feature engineering operations above;
Keras Preprocess Layer. This will be ready in TF2.2;

The text was updated successfully, but these errors were encountered:

brightcoder01 · 2020-01-29T13:11:18Z

Let's take the simple DNN model of the census income dataset from ElasticDL model zoo.
The SQLFlow expression proposal is

SELECT *
FROM census_income
TO TRAIN DNNClassifier
WITH model.hidden_units = [10, 20]
COLUMN (
    age, 
    capital_gain, 
    capital_loss, 
    hours_per_week, 
    EMBEDDING(HASH(workclass, 64), 16),
    EMBEDDING(HASH(education, 64), 16),
    EMBEDDING(HASH(martial_status, 64), 16),
    EMBEDDING(HASH(occupation, 64), 16),
    EMBEDDING(HASH(relationship, 64), 16),
    EMBEDDING(HASH(race, 64), 16),
    EMBEDDING(HASH(sex, 64), 16),
    EMBEDDING(HASH(native_country, 64), 16)
)
LABEL label

brightcoder01 · 2020-01-29T13:27:12Z

Let's take the wide and deep model of the census income dataset in the PR #1671 for example.
The SQLFlow expression proposal is

SELECT *
FROM census_income
TO TRAIN WideAndDeepClassifier
COLUMN (
    SET GROUP(APPLY_VOCAB(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1,
    SET GROUP(HASH(education), HASH(occupation), APPLY_VOCAB(martial_status), APPLY_VOCAB(relationship)) AS group_2,
    SET GROUP(BUCKETIZE(age, bucket_num=5), HASH(native_country), APPLY_VOCAB(race), APPLY_VOCAB(sex)) AS group_3,

    [EMBEDDING(group1, 1), EMBEDDING(group2, 1)] AS wide_embeddings
    [EMBEDDING(group1, 8), EMBEDDING(group2, 8), EMBEDDING(group3, 8)] AS deep_embeddings
)
LABEL label

workingloong · 2020-01-29T13:57:58Z

SELECT *
FROM census_income
TO TRAIN WideAndDeepClassifier
COLUMNS (
SET GROUP(APPLY_VOCAB(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1,
SET GROUP(HASH(education), HASH(occupation), APPLY_VOCAB(martial_status), APPLY_VOCAB(relationship)) AS group_2,
SET GROUP(BUCKETIZE(age, bucket_num=5), HASH(native_country), APPLY_VOCAB(race), APPLY_VOCAB(sex)) AS group_3,

[EMBEDDING(group1, 1), EMBEDDING(group2, 1)] AS wide
[EMBEDDING(group1, 8), EMBEDDING(group2, 8), EMBEDDING(group3, 8)] AS deep
)
LABEL label

Maybe a nested expression?

SELECT *
FROM census_income
TO TRAIN WideAndDeepClassifier
COLUMNS (
    [EMBEDDING(group1, 1), EMBEDDING(group2, 1)] AS wide
    [EMBEDDING(group1, 8), EMBEDDING(group2, 8), EMBEDDING(group3, 8)] AS deep
    FROM(
        GROUP(APPLY_VOCAB(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1,
        GROUP(HASH(education), HASH(occupation), APPLY_VOCAB(martial_status), APPLY_VOCAB(relationship)) AS group_2,
        GROUP(BUCKETIZE(age, bucket_num=5), HASH(native_country), APPLY_VOCAB(race), APPLY_VOCAB(sex)) AS group_3,
    )
)
LABEL label

workingloong · 2020-02-07T08:37:10Z

We can implement LOG_ROUND using BUCKETIZE(x, bucket_boundaries), so it can be removed. And why do we use HASH, not HASH_BUCKET? HASH may be confusing.

brightcoder01 · 2020-02-11T03:34:51Z

Let's take the wide and deep model of the census income dataset in the PR #1671 for example.
The SQLFlow expression proposal is

SELECT *
FROM census_income
TO TRAIN WideAndDeepClassifier
COLUMN (
    SET GROUP(APPLY_VOCAB(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1,
    SET GROUP(HASH(education), HASH(occupation), APPLY_VOCAB(martial_status), APPLY_VOCAB(relationship)) AS group_2,
    SET GROUP(BUCKETIZE(age, bucket_num=5), HASH(native_country), APPLY_VOCAB(race), APPLY_VOCAB(sex)) AS group_3,

    [EMBEDDING(group1, 1), EMBEDDING(group2, 1)] AS wide_embeddings
    [EMBEDDING(group1, 8), EMBEDDING(group2, 8), EMBEDDING(group3, 8)] AS deep_embeddings
)
LABEL label

After discussion in the link, we will choose the following syntax design:

SELECT *
FROM census_income
TO TRAIN WideAndDeepClassifier
COLUMN
    EMBEDDING(CONCAT(APPLY_VOCAB(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1, 8),
    EMBEDDING(CONCAT(HASH(education), HASH(occupation), APPLY_VOCAB(martial_status), APPLY_VOCAB(relationship)) AS group_2, 8),
    EMBEDDING(CONCAT(BUCKETIZE(age, bucket_num=5), HASH(native_country), APPLY_VOCAB(race), APPLY_VOCAB(sex)) AS group_3, 8)
    FOR deep_embeddings
COLUMN
    EMBEDDING(group1, 1),
    EMBEDDING(group2, 1)
    FOR wide_embeddings
LABEL label

brightcoder01 · 2020-02-11T06:18:02Z

We can implement LOG_ROUND using BUCKETIZE(x, bucket_boundaries), so it can be removed. And why do we use HASH, not HASH_BUCKET? HASH may be confusing.

For the suggestion LOG_ROUND -> BUCKETIZE, what should user write in COLUMN clause to express the logic of ROUND(LOG(X))? Should user write the bucket boundary explicitly in the COLUMN clause?
Renamed HASH to HASH_BUCKET.

workingloong · 2020-02-11T08:09:21Z

uld user write the bucket boundary explicitly in the COLUMN clause?

Yes, I think we should expose the bucket boundary or bucket number to users. We can directly use the bucket boundaries if the user defines it. If not, we can inference the boundaries by the bucket number with equal frequency or equal distance.

brightcoder01 added data transform discussion labels Jan 19, 2020

brightcoder01 changed the title ~~Extend SQLFlow syntax for data transform and generate the transform python code with code_gen.~~ Extend SQLFlow syntax for data transform and generate the transform code with code_gen. Jan 19, 2020

brightcoder01 mentioned this issue Jan 19, 2020

Data transform and analysis discussion. #1670

Open

brightcoder01 changed the title ~~Extend SQLFlow syntax for data transform and generate the transform code with code_gen.~~ Design SQLFlow syntax extension for data transform. Generate the transform code with code_gen. Jan 30, 2020

brightcoder01 changed the title ~~Design SQLFlow syntax extension for data transform. Generate the transform code with code_gen.~~ Design SQLFlow syntax extension for data transform. Jan 31, 2020

brightcoder01 mentioned this issue Feb 9, 2020

Add concat_column in ElasticDL feature column #1719

Merged

brightcoder01 self-assigned this Feb 11, 2020

brightcoder01 mentioned this issue Feb 12, 2020

Add the transform function Api design #1725

Merged

brightcoder01 closed this as completed in #1725 Feb 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design SQLFlow syntax extension for data transform. #1664

Design SQLFlow syntax extension for data transform. #1664

brightcoder01 commented Jan 19, 2020 •

edited

Loading

brightcoder01 commented Jan 29, 2020 •

edited

Loading

brightcoder01 commented Jan 29, 2020 •

edited

Loading

workingloong commented Jan 29, 2020

workingloong commented Feb 7, 2020

brightcoder01 commented Feb 11, 2020 •

edited

Loading

brightcoder01 commented Feb 11, 2020 •

edited

Loading

workingloong commented Feb 11, 2020

Design SQLFlow syntax extension for data transform. #1664

Design SQLFlow syntax extension for data transform. #1664

Comments

brightcoder01 commented Jan 19, 2020 • edited Loading

brightcoder01 commented Jan 29, 2020 • edited Loading

brightcoder01 commented Jan 29, 2020 • edited Loading

workingloong commented Jan 29, 2020

workingloong commented Feb 7, 2020

brightcoder01 commented Feb 11, 2020 • edited Loading

brightcoder01 commented Feb 11, 2020 • edited Loading

workingloong commented Feb 11, 2020

brightcoder01 commented Jan 19, 2020 •

edited

Loading

brightcoder01 commented Jan 29, 2020 •

edited

Loading

brightcoder01 commented Jan 29, 2020 •

edited

Loading

brightcoder01 commented Feb 11, 2020 •

edited

Loading

brightcoder01 commented Feb 11, 2020 •

edited

Loading