-
Notifications
You must be signed in to change notification settings - Fork 115
Design SQLFlow syntax extension for data transform. #1664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Let's take the simple DNN model of the census income dataset from ElasticDL model zoo. SELECT *
FROM census_income
TO TRAIN DNNClassifier
WITH model.hidden_units = [10, 20]
COLUMN (
age,
capital_gain,
capital_loss,
hours_per_week,
EMBEDDING(HASH(workclass, 64), 16),
EMBEDDING(HASH(education, 64), 16),
EMBEDDING(HASH(martial_status, 64), 16),
EMBEDDING(HASH(occupation, 64), 16),
EMBEDDING(HASH(relationship, 64), 16),
EMBEDDING(HASH(race, 64), 16),
EMBEDDING(HASH(sex, 64), 16),
EMBEDDING(HASH(native_country, 64), 16)
)
LABEL label |
Let's take the wide and deep model of the census income dataset in the PR #1671 for example. SELECT *
FROM census_income
TO TRAIN WideAndDeepClassifier
COLUMN (
SET GROUP(APPLY_VOCAB(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1,
SET GROUP(HASH(education), HASH(occupation), APPLY_VOCAB(martial_status), APPLY_VOCAB(relationship)) AS group_2,
SET GROUP(BUCKETIZE(age, bucket_num=5), HASH(native_country), APPLY_VOCAB(race), APPLY_VOCAB(sex)) AS group_3,
[EMBEDDING(group1, 1), EMBEDDING(group2, 1)] AS wide_embeddings
[EMBEDDING(group1, 8), EMBEDDING(group2, 8), EMBEDDING(group3, 8)] AS deep_embeddings
)
LABEL label |
Maybe a nested expression? SELECT *
FROM census_income
TO TRAIN WideAndDeepClassifier
COLUMNS (
[EMBEDDING(group1, 1), EMBEDDING(group2, 1)] AS wide
[EMBEDDING(group1, 8), EMBEDDING(group2, 8), EMBEDDING(group3, 8)] AS deep
FROM(
GROUP(APPLY_VOCAB(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1,
GROUP(HASH(education), HASH(occupation), APPLY_VOCAB(martial_status), APPLY_VOCAB(relationship)) AS group_2,
GROUP(BUCKETIZE(age, bucket_num=5), HASH(native_country), APPLY_VOCAB(race), APPLY_VOCAB(sex)) AS group_3,
)
)
LABEL label |
We can implement LOG_ROUND using BUCKETIZE(x, bucket_boundaries), so it can be removed. And why do we use HASH, not HASH_BUCKET? HASH may be confusing. |
After discussion in the link, we will choose the following syntax design: SELECT *
FROM census_income
TO TRAIN WideAndDeepClassifier
COLUMN
EMBEDDING(CONCAT(APPLY_VOCAB(workclass), BUCKETIZE(capital_gain, bucket_num=5), BUCKETIZE(capital_loss, bucket_num=5), BUCKTIZE(hours_per_week, bucket_num=6)) AS group_1, 8),
EMBEDDING(CONCAT(HASH(education), HASH(occupation), APPLY_VOCAB(martial_status), APPLY_VOCAB(relationship)) AS group_2, 8),
EMBEDDING(CONCAT(BUCKETIZE(age, bucket_num=5), HASH(native_country), APPLY_VOCAB(race), APPLY_VOCAB(sex)) AS group_3, 8)
FOR deep_embeddings
COLUMN
EMBEDDING(group1, 1),
EMBEDDING(group2, 1)
FOR wide_embeddings
LABEL label |
For the suggestion LOG_ROUND -> BUCKETIZE, what should user write in COLUMN clause to express the logic of |
Yes, I think we should expose the bucket boundary or bucket number to users. We can directly use the bucket boundaries if the user defines it. If not, we can inference the boundaries by the bucket number with equal frequency or equal distance. |
The root of the discussion series is #1670
The following transform functions are common used. We can support these in the first stage.
out = x - x_min / (x_max - x_min)
out = x - x_mean / x_stddev
out_id = Hash(input_feature) % bucket_size
There are three options for the style of the generated transform code:
The text was updated successfully, but these errors were encountered: