Skip to content

Modify RowConstructor to work with WithColumn #214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Sep 9, 2019

Conversation

suhsteve
Copy link
Member

@suhsteve suhsteve commented Aug 20, 2019

This allows nested Rows with nested StructTypes to be properly unpickled as well as allows Rows to be passed to Udfs by using the WithColumn method.

            _df = _spark
                .Read()
                .Schema("age INT, name STRING")
                .Json($"{TestEnvironment.ResourceDirectory}people.json");

            Func<Column, Column> AgeIsNull = Udf<Row, bool>(
                r =>
                {
                    var age = r.GetAs<int?>("age");
                    return !age.HasValue;
                });

            string[] allCols = _df.Columns().ToArray();
            DataFrame dummyColDF =
                _df.WithColumn("DummyCol", Struct(allCols[0], allCols.Skip(1).ToArray()));
            DataFrame ageIsNullDF =
                dummyColDF.WithColumn("AgeIsNull", AgeIsNull(dummyColDF["DummyCol"]));

Produces the following Schema for the ageIsNullDF Dataframe

{
    "type": "struct",
    "fields": [
        {
            "name": "age",
            "type": "integer",
            "nullable": true,
            "metadata": {}
        },
        {
            "name": "name",
            "type": "string",
            "nullable": true,
            "metadata": {}
        },
        {
            "name": "DummyCol",
            "type": {
                "type": "struct",
                "fields": [
                    {
                        "name": "age",
                        "type": "integer",
                        "nullable": true,
                        "metadata": {}
                    },
                    {
                        "name": "name",
                        "type": "string",
                        "nullable": true,
                        "metadata": {}
                    }
                ]
            },
            "nullable": false,
            "metadata": {}
        },
        {
            "name": "AgeIsNull",
            "type": "boolean",
            "nullable": true,
            "metadata": {}
        }
    ]
}

Benchmark:
HDInsight
D14v2 16 CPU, 112GB Ram
Spark 2.3.2
tpch query 1
dataset size 2.8GB Parquet (generated with db-gen with a 9 scale factor)
master -> 72fdf4834d5d9e7df06efec29233734aaec9445b
PR -> 882cdbe9ac8f657163a6777a72577db45f1b0154

  • local[14]

    Iteration Time(ms) master Time(ms) PR
    1 90607 83569
    2 93520 88445
    3 95719 91437
    4 89949 78545
    5 94954 85341
    6 96525 77718
    7 98416 87040
    8 72576 69994
    9 78562 77574
    10 84906 89991
    11 77176 79207
    12 97384 72655
    13 99845 81865
    14 103916 87758
    15 96931 92020
    16 98231 102719
    17 101917 100198
    18 103467 105485
    19 83429 93286
    Avg 92527.89 86570.89
  • local[7]

    Iteration Time(ms) master Time(ms) PR
    1 43714 41642
    2 44725 40108
    3 47259 41202
    4 36858 37197
    5 37770 34779
    6 36003 37000
    7 35127 37738
    8 38190 41163
    9 43216 45113
    10 36987 39170
    11 41157 38587
    12 41458 39658
    13 41671 40793
    14 39415 43389
    15 41639 42874
    16 42248 42231
    17 48151 48480
    18 48973 39301
    19 39647 44853
    Avg 41274.10 40804.10

@imback82
Copy link
Contributor

Can you check the test failures?

@imback82 imback82 added the enhancement New feature or request label Aug 26, 2019
@suhsteve suhsteve force-pushed the stsuh/rowconstructor branch from 2a5d462 to 4cb15f1 Compare August 30, 2019 22:45
imback82
imback82 previously approved these changes Sep 9, 2019
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great work!

@imback82 imback82 merged commit e1d2db0 into dotnet:master Sep 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants