Skip to content

Design SQLFlow syntax extension for data transform. #1664

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brightcoder01 opened this issue Jan 19, 2020 · 7 comments · Fixed by #1725
Closed

Design SQLFlow syntax extension for data transform. #1664

brightcoder01 opened this issue Jan 19, 2020 · 7 comments · Fixed by #1725

Comments

@brightcoder01
Copy link
Collaborator

brightcoder01 commented Jan 19, 2020

The root of the discussion series is #1670

The following transform functions are common used. We can support these in the first stage.

Name Transformation Statitical Parameter Input Type Output Type
NORMALIZE(x) Scale the inputs to the range [0, 1]. out = x - x_min / (x_max - x_min) x_min, x_max number float64
STANDARDIZE(x) Scale the inputs to z-score subtracts out the mean and divides by standard deviation. out = x - x_mean / x_stddev x_mean, x_stddev number float64
BUCKETIZE(x, num_buckets, boundaries) Transform the numeric features into categorical ids using a set of thresholds. boundaries Number int64
HASH_BUCKET(x, hash_bucket_size) Map the inputs into a finite number of buckets by hashing. out_id = Hash(input_feature) % bucket_size hash_bucket_size string, int32, int64 int64
VOCABULARIZE(x) Map the inputs to integer ids by looking up the vocabulary vocabulary_list string, int32, int64 int64
EMBEDDING(x, dimension) Map the inputs to embedding vectors N/A int32, int64 float32
CROSS(x1, x2, ..., xn, hash_bucket_size) Hash(cartesian product of features) % hash_bucket_size N/A string, number int64
CONCAT(x1, x2, ..., xn) Concatenate multiple tensors representing categorical ids into one tensor. N/A int32, int64 int64

There are three options for the style of the generated transform code:

  1. Feature Column API. Integrate it with model definition using tf.keras.layers.DenseFeatures;
  2. Customized Keras Layer provided from ElasticDL. The functionality should cover all the common used feature engineering operations above;
  3. Keras Preprocess Layer. This will be ready in TF2.2;
@brightcoder01 brightcoder01 changed the title Extend SQLFlow syntax for data transform and generate the transform python code with code_gen. Extend SQLFlow syntax for data transform and generate the transform code with code_gen. Jan 19, 2020
@brightcoder01
Copy link
Collaborator Author

brightcoder01 commented Jan 29, 2020

Let's take the simple DNN model of the census income dataset from ElasticDL model zoo.
The SQLFlow expression proposal is

SELECT *
FROM census_income
TO TRAIN DNNClassifier
WITH model.hidden_units = [10, 20]
COLUMN (
    age, 
    capital_gain, 
    capital_loss, 
    hours_per_week, 
    EMBEDDING(HASH(workclass, 64), 16),
    EMBEDDING(HASH(education, 64), 16),
    EMBEDDING(HASH(martial_status, 64), 16),
    EMBEDDING(HASH(occupation, 64), 16),
    EMBEDDING(HASH(relationship, 64), 16),
    EMBEDDING(HASH(race, 64), 16),
    EMBEDDING(HASH(sex, 64), 16),
    EMBEDDING(HASH(native_country, 64), 16)
)
LABEL label

@brightcoder01
Copy link
Collaborator Author

brightcoder01 commented Jan 29, 2020

Let's take the wide and deep model of the census income dataset in the PR #1671 for example.
The SQLFlow expression proposal is

SELECT *
FROM census_income
TO TRAIN WideAndDeepClassifier
COLUMN (
    SET GROUP(APPLY_VOCAB(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1,
    SET GROUP(HASH(education), HASH(occupation), APPLY_VOCAB(martial_status), APPLY_VOCAB(relationship)) AS group_2,
    SET GROUP(BUCKETIZE(age, bucket_num=5), HASH(native_country), APPLY_VOCAB(race), APPLY_VOCAB(sex)) AS group_3,

    [EMBEDDING(group1, 1), EMBEDDING(group2, 1)] AS wide_embeddings
    [EMBEDDING(group1, 8), EMBEDDING(group2, 8), EMBEDDING(group3, 8)] AS deep_embeddings
)
LABEL label

@workingloong
Copy link
Collaborator

SELECT *
FROM census_income
TO TRAIN WideAndDeepClassifier
COLUMNS (
SET GROUP(APPLY_VOCAB(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1,
SET GROUP(HASH(education), HASH(occupation), APPLY_VOCAB(martial_status), APPLY_VOCAB(relationship)) AS group_2,
SET GROUP(BUCKETIZE(age, bucket_num=5), HASH(native_country), APPLY_VOCAB(race), APPLY_VOCAB(sex)) AS group_3,

[EMBEDDING(group1, 1), EMBEDDING(group2, 1)] AS wide
[EMBEDDING(group1, 8), EMBEDDING(group2, 8), EMBEDDING(group3, 8)] AS deep
)
LABEL label

Maybe a nested expression?

SELECT *
FROM census_income
TO TRAIN WideAndDeepClassifier
COLUMNS (
    [EMBEDDING(group1, 1), EMBEDDING(group2, 1)] AS wide
    [EMBEDDING(group1, 8), EMBEDDING(group2, 8), EMBEDDING(group3, 8)] AS deep
    FROM(
        GROUP(APPLY_VOCAB(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1,
        GROUP(HASH(education), HASH(occupation), APPLY_VOCAB(martial_status), APPLY_VOCAB(relationship)) AS group_2,
        GROUP(BUCKETIZE(age, bucket_num=5), HASH(native_country), APPLY_VOCAB(race), APPLY_VOCAB(sex)) AS group_3,
    )
)
LABEL label

@brightcoder01 brightcoder01 changed the title Extend SQLFlow syntax for data transform and generate the transform code with code_gen. Design SQLFlow syntax extension for data transform. Generate the transform code with code_gen. Jan 30, 2020
@brightcoder01 brightcoder01 changed the title Design SQLFlow syntax extension for data transform. Generate the transform code with code_gen. Design SQLFlow syntax extension for data transform. Jan 31, 2020
@workingloong
Copy link
Collaborator

We can implement LOG_ROUND using BUCKETIZE(x, bucket_boundaries), so it can be removed. And why do we use HASH, not HASH_BUCKET? HASH may be confusing.

@brightcoder01
Copy link
Collaborator Author

brightcoder01 commented Feb 11, 2020

Let's take the wide and deep model of the census income dataset in the PR #1671 for example.
The SQLFlow expression proposal is

SELECT *
FROM census_income
TO TRAIN WideAndDeepClassifier
COLUMN (
    SET GROUP(APPLY_VOCAB(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1,
    SET GROUP(HASH(education), HASH(occupation), APPLY_VOCAB(martial_status), APPLY_VOCAB(relationship)) AS group_2,
    SET GROUP(BUCKETIZE(age, bucket_num=5), HASH(native_country), APPLY_VOCAB(race), APPLY_VOCAB(sex)) AS group_3,

    [EMBEDDING(group1, 1), EMBEDDING(group2, 1)] AS wide_embeddings
    [EMBEDDING(group1, 8), EMBEDDING(group2, 8), EMBEDDING(group3, 8)] AS deep_embeddings
)
LABEL label

After discussion in the link, we will choose the following syntax design:

SELECT *
FROM census_income
TO TRAIN WideAndDeepClassifier
COLUMN
    EMBEDDING(CONCAT(APPLY_VOCAB(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1, 8),
    EMBEDDING(CONCAT(HASH(education), HASH(occupation), APPLY_VOCAB(martial_status), APPLY_VOCAB(relationship)) AS group_2, 8),
    EMBEDDING(CONCAT(BUCKETIZE(age, bucket_num=5), HASH(native_country), APPLY_VOCAB(race), APPLY_VOCAB(sex)) AS group_3, 8)
    FOR deep_embeddings
COLUMN
    EMBEDDING(group1, 1),
    EMBEDDING(group2, 1)
    FOR wide_embeddings
LABEL label

@brightcoder01 brightcoder01 self-assigned this Feb 11, 2020
@brightcoder01
Copy link
Collaborator Author

brightcoder01 commented Feb 11, 2020

We can implement LOG_ROUND using BUCKETIZE(x, bucket_boundaries), so it can be removed. And why do we use HASH, not HASH_BUCKET? HASH may be confusing.

For the suggestion LOG_ROUND -> BUCKETIZE, what should user write in COLUMN clause to express the logic of ROUND(LOG(X))? Should user write the bucket boundary explicitly in the COLUMN clause?
Renamed HASH to HASH_BUCKET.

@workingloong
Copy link
Collaborator

uld user write the bucket boundary explicitly in the COLUMN clause?

Yes, I think we should expose the bucket boundary or bucket number to users. We can directly use the bucket boundaries if the user defines it. If not, we can inference the boundaries by the bucket number with equal frequency or equal distance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants