From 03b79393e71910a33a39864e563fcbeb2de56658 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Sun, 19 Apr 2020 22:31:05 -0700 Subject: [PATCH 01/18] Adding section for UDF serialization --- docs/broadcast-guide.md | 92 +++++++++++++++++++++ docs/udf-guide.md | 172 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 264 insertions(+) create mode 100644 docs/broadcast-guide.md create mode 100644 docs/udf-guide.md diff --git a/docs/broadcast-guide.md b/docs/broadcast-guide.md new file mode 100644 index 000000000..4286c569e --- /dev/null +++ b/docs/broadcast-guide.md @@ -0,0 +1,92 @@ +# Guide to using Broadcast Variables + +This is a guide to show how to use broadcast variables in .NET for Apache Spark. + +## What are Broadcast Variables + +[Broadcast variables in Apache Spark](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) are a mechanism for sharing variables across executors that are meant to be read-only. They allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. + +### How to use broadcast variables in .NET for Apache Spark + +Broadcast variables are created from a variable `v` by calling `SparkContext.Broadcast(v)`. The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the `Value()` method on it. + +Example: + +```csharp +string v = "Variable to be broadcasted"; +Broadcast bv = SparkContext.Broadcast(v); + +// Using the broadcast variable in a UDF: +Func udf = Udf( + str => $"{str}: {bv.Value()}"); +``` + +The type of broadcast variable is captured by using Generics in C#, as can be seen in the above example. + +### Deleting broadcast variables + +The broadcast variable can be deleted from all executors by calling the `Destroy()` function on it. + +```csharp +// Destroying the broadcast variable bv: +bv.Destroy(); +``` + +> Note: `Destroy` deletes all data and metadata related to the broadcast variable. Use this with caution- once a broadcast variable has been destroyed, it cannot be used again. + +#### Caveat of using Destroy + +One important thing to keep in mind while using broadcast variables in UDFs is to limit the scope of the variable to only the UDF that is referencing it. The [guide to using UDFs](udf-guide.md) describes this phenomenon in detail. This is especially crucial when calling `Destroy` on the broadcast variable. If the broadcast variable that has been destroyed is visible to or accessible from other UDFs, it gets picked up for serialization by all those UDFs, even if it is not being referenced by them. This will throw an error as .NET for Apache Spark is not able to serialize the destroyed broadcast variable. + +Example to demonstrate: + +```csharp +string v = "Variable to be broadcasted"; +Broadcast bv = SparkContext.Broadcast(v); + +// Using the broadcast variable in a UDF: +Func udf1 = Udf( + str => $"{str}: {bv.Value()}"); + +// Destroying bv +bv.Destroy(); + +// Calling udf1 after destroying bv throws the following expected exception: +// org.apache.spark.SparkException: Attempted to use Broadcast(0) after it was destroyed +df.Select(udf1(df["_1"])).Show(); + +// Different UDF udf2 that is not referencing bv +Func udf2 = Udf( + str => $"{str}: not referencing broadcast variable"); + +// Calling udf2 throws the following (unexpected) exception: +// [Error] [JvmBridge] org.apache.spark.SparkException: Task not serializable +df.Select(udf2(df["_1"])).Show(); +``` + +The recommended way of implementing above desired behavior: + +```csharp +string v = "Variable to be broadcasted"; +// Restricting the visibility of bv to only the UDF referencing it +{ + Broadcast bv = SparkContext.Broadcast(v); + + // Using the broadcast variable in a UDF: + Func udf1 = Udf( + str => $"{str}: {bv.Value()}"); + + // Destroying bv + bv.Destroy(); +} + +// Different UDF udf2 that is not referencing bv +Func udf2 = Udf( + str => $"{str}: not referencing broadcast variable"); + +// Calling udf2 works fine as expected +df.Select(udf2(df["_1"])).Show(); +``` + This ensures that destroying `bv` doesn't affect calling `udf2` because of unexpected serialization behavior. + + Broadcast variables are very useful for transmitting read-only data to all executors, as the data is sent only once and this gives huge performance benefits when compared with using local variables that get shipped to the executors with each task. Please refer to the [official documentation](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) to get a deeper understanding of broadcast variables and why they are used. \ No newline at end of file diff --git a/docs/udf-guide.md b/docs/udf-guide.md new file mode 100644 index 000000000..bb308815d --- /dev/null +++ b/docs/udf-guide.md @@ -0,0 +1,172 @@ +# Guide to User-Defined Functions (UDFs) + +This is a guide to show how to use UDFs in .NET for Apache Spark. + +## What are UDFs + +[User-Defined Functions (UDFs)](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/expressions/UserDefinedFunction.html) are a feature of Spark that allow developers to use custom functions to extend the system's built-in functionality. They transform values from a single row within a table to produce a single corresponding output value per row based on the logic defined in the UDF. + +Let's take the following as an example for a UDF definition: + +```csharp +string s1 = "hello"; +Func udf = Udf( + str => $"{s1} {str}"); + +``` +The above defined UDF takes a `string` as an input (in the form of a [Column](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/Column.cs#L14) of a [Dataframe](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/DataFrame.cs#L24)), and returns a `string` with `hello` appended in front of the input. + +For a sample Dataframe, let's take the following Dataframe `df`: + +```text ++-------+ +| name| ++-------+ +|Michael| +| Andy| +| Justin| ++-------+ +``` + +Now let's apply the above defined `udf` to the dataframe `df`: + +```csharp +DataFrame udfResult = df.Select(udf(df["name"])); +``` + +This would return the below as the Dataframe `udfResult`: + +```text ++-------------+ +| name| ++-------------+ +|hello Michael| +| hello Andy| +| hello Justin| ++-------------+ +``` +To get a better understanding of how to implement UDFs, please take a look at the [UDF helper functions](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/Functions.cs#L3616) and some [test examples](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark.E2ETest/UdfTests/UdfSimpleTypesTests.cs#L49). + +## UDF serialization + +Since UDFs are functions that need to be executed on the workers, they have to be serialized and sent to the workers as part of the payload from the driver. This involves serializing the [delegate](https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/delegates/) which is a reference to the method, along with its [target](https://docs.microsoft.com/en-us/dotnet/api/system.delegate.target?view=netframework-4.8) which is the class instance on which the current delegate invokes the instance method. Please take a look at this [code](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs#L149) to get a better understanding of how UDF serialization is being done. + +## Good to know while implementing UDFs + +One behavior to be aware of while implementing UDFs in .NET for Apache Spark is how the target of the UDF gets serialized. .NET for Apache Spark uses .NET Core, which does not support serializing delegates, so it is instead done by using reflection to serialize the target where the delegate is defined. When multiple delegates are defined in a common scope, they have a shared closure that becomes the target of reflection for serialization. Let's take an example to illustrate what that means. + +The following code snippet defines two string variables that are being referenced in two function delegates, that just return the respective strings as result: + +```csharp +using System; + +public class C { + public void M() { + string s1 = "s1"; + string s2 = "s2"; + Func a = str => s1; + Func b = str => s2; + } +} +``` + +The above C# code generates the following C# disassembly (credit source: [sharplab.io](sharplab.io)) code from the compiler: + +```csharp +public class C +{ + [CompilerGenerated] + private sealed class <>c__DisplayClass0_0 + { + public string s1; + + public string s2; + + internal string b__0(string str) + { + return s1; + } + + internal string b__1(string str) + { + return s2; + } + } + + public void M() + { + <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0(); + <>c__DisplayClass0_.s1 = "s1"; + <>c__DisplayClass0_.s2 = "s2"; + Func func = new Func(<>c__DisplayClass0_.b__0); + Func func2 = new Func(<>c__DisplayClass0_.b__1); + } +} +``` +As can be seen in the above IL code, both `func` and `func2` share the same closure `<>c__DisplayClass0_0`, which is the target that is serialized when serializing the delegates `func` and `func2`. Hence, even though `Func a` is only referencing `s1`, `s2` also gets serialized when sending over the bytes to the workers. + +This can lead to some unexpected behaviors at runtime (like in the case of using [broadcast variables](broadcast-guide.md)), which is why we recommend restricting the visibility of the variables used in a function to that function's scope. +Taking the above example to better explain what that means: + +Recommended user code to implement desired behavior of previous code snippet: + +```csharp +using System; + +public class C { + public void M() { + { + string s1 = "s1"; + Func a = str => s1; + } + { + string s2 = "s2"; + Func b = str => s2; + } + } +} +``` + +The above C# code generates the following C# disassembly (credit source: [sharplab.io](sharplab.io)) code from the compiler: + +```csharp +public class C +{ + [CompilerGenerated] + private sealed class <>c__DisplayClass0_0 + { + public string s1; + + internal string b__0(string str) + { + return s1; + } + } + + [CompilerGenerated] + private sealed class <>c__DisplayClass0_1 + { + public string s2; + + internal string b__1(string str) + { + return s2; + } + } + + public void M() + { + <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0(); + <>c__DisplayClass0_.s1 = "s1"; + Func func = new Func(<>c__DisplayClass0_.b__0); + <>c__DisplayClass0_1 <>c__DisplayClass0_2 = new <>c__DisplayClass0_1(); + <>c__DisplayClass0_2.s2 = "s2"; + Func func2 = new Func(<>c__DisplayClass0_2.b__1); + } +} +``` + +Here we see that `func` and `func2` no longer share a closure and have their own separate closures `<>c__DisplayClass0_0` and `<>c__DisplayClass0_1` respectively. When used as the target for serialization, nothing other than the referenced variables will get serialized for the delegate. + +This above behavior is important to keep in mind while implementing multiple UDFs in a common scope. +To learn more about UDFs in general, please review the following articles that explain UDFs and how to use them: [UDFs in databricks(scala)](https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html), [Spark UDFs and some gotchas](https://medium.com/@achilleus/spark-udfs-we-can-use-them-but-should-we-use-them-2c5a561fde6d). \ No newline at end of file From 4ef693dbf7616b738a6ae70d1e9dc8c12dd8e5d3 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Sun, 19 Apr 2020 22:32:56 -0700 Subject: [PATCH 02/18] removing guides from master --- docs/broadcast-guide.md | 92 --------------------- docs/udf-guide.md | 172 ---------------------------------------- 2 files changed, 264 deletions(-) delete mode 100644 docs/broadcast-guide.md delete mode 100644 docs/udf-guide.md diff --git a/docs/broadcast-guide.md b/docs/broadcast-guide.md deleted file mode 100644 index 4286c569e..000000000 --- a/docs/broadcast-guide.md +++ /dev/null @@ -1,92 +0,0 @@ -# Guide to using Broadcast Variables - -This is a guide to show how to use broadcast variables in .NET for Apache Spark. - -## What are Broadcast Variables - -[Broadcast variables in Apache Spark](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) are a mechanism for sharing variables across executors that are meant to be read-only. They allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. - -### How to use broadcast variables in .NET for Apache Spark - -Broadcast variables are created from a variable `v` by calling `SparkContext.Broadcast(v)`. The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the `Value()` method on it. - -Example: - -```csharp -string v = "Variable to be broadcasted"; -Broadcast bv = SparkContext.Broadcast(v); - -// Using the broadcast variable in a UDF: -Func udf = Udf( - str => $"{str}: {bv.Value()}"); -``` - -The type of broadcast variable is captured by using Generics in C#, as can be seen in the above example. - -### Deleting broadcast variables - -The broadcast variable can be deleted from all executors by calling the `Destroy()` function on it. - -```csharp -// Destroying the broadcast variable bv: -bv.Destroy(); -``` - -> Note: `Destroy` deletes all data and metadata related to the broadcast variable. Use this with caution- once a broadcast variable has been destroyed, it cannot be used again. - -#### Caveat of using Destroy - -One important thing to keep in mind while using broadcast variables in UDFs is to limit the scope of the variable to only the UDF that is referencing it. The [guide to using UDFs](udf-guide.md) describes this phenomenon in detail. This is especially crucial when calling `Destroy` on the broadcast variable. If the broadcast variable that has been destroyed is visible to or accessible from other UDFs, it gets picked up for serialization by all those UDFs, even if it is not being referenced by them. This will throw an error as .NET for Apache Spark is not able to serialize the destroyed broadcast variable. - -Example to demonstrate: - -```csharp -string v = "Variable to be broadcasted"; -Broadcast bv = SparkContext.Broadcast(v); - -// Using the broadcast variable in a UDF: -Func udf1 = Udf( - str => $"{str}: {bv.Value()}"); - -// Destroying bv -bv.Destroy(); - -// Calling udf1 after destroying bv throws the following expected exception: -// org.apache.spark.SparkException: Attempted to use Broadcast(0) after it was destroyed -df.Select(udf1(df["_1"])).Show(); - -// Different UDF udf2 that is not referencing bv -Func udf2 = Udf( - str => $"{str}: not referencing broadcast variable"); - -// Calling udf2 throws the following (unexpected) exception: -// [Error] [JvmBridge] org.apache.spark.SparkException: Task not serializable -df.Select(udf2(df["_1"])).Show(); -``` - -The recommended way of implementing above desired behavior: - -```csharp -string v = "Variable to be broadcasted"; -// Restricting the visibility of bv to only the UDF referencing it -{ - Broadcast bv = SparkContext.Broadcast(v); - - // Using the broadcast variable in a UDF: - Func udf1 = Udf( - str => $"{str}: {bv.Value()}"); - - // Destroying bv - bv.Destroy(); -} - -// Different UDF udf2 that is not referencing bv -Func udf2 = Udf( - str => $"{str}: not referencing broadcast variable"); - -// Calling udf2 works fine as expected -df.Select(udf2(df["_1"])).Show(); -``` - This ensures that destroying `bv` doesn't affect calling `udf2` because of unexpected serialization behavior. - - Broadcast variables are very useful for transmitting read-only data to all executors, as the data is sent only once and this gives huge performance benefits when compared with using local variables that get shipped to the executors with each task. Please refer to the [official documentation](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) to get a deeper understanding of broadcast variables and why they are used. \ No newline at end of file diff --git a/docs/udf-guide.md b/docs/udf-guide.md deleted file mode 100644 index bb308815d..000000000 --- a/docs/udf-guide.md +++ /dev/null @@ -1,172 +0,0 @@ -# Guide to User-Defined Functions (UDFs) - -This is a guide to show how to use UDFs in .NET for Apache Spark. - -## What are UDFs - -[User-Defined Functions (UDFs)](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/expressions/UserDefinedFunction.html) are a feature of Spark that allow developers to use custom functions to extend the system's built-in functionality. They transform values from a single row within a table to produce a single corresponding output value per row based on the logic defined in the UDF. - -Let's take the following as an example for a UDF definition: - -```csharp -string s1 = "hello"; -Func udf = Udf( - str => $"{s1} {str}"); - -``` -The above defined UDF takes a `string` as an input (in the form of a [Column](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/Column.cs#L14) of a [Dataframe](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/DataFrame.cs#L24)), and returns a `string` with `hello` appended in front of the input. - -For a sample Dataframe, let's take the following Dataframe `df`: - -```text -+-------+ -| name| -+-------+ -|Michael| -| Andy| -| Justin| -+-------+ -``` - -Now let's apply the above defined `udf` to the dataframe `df`: - -```csharp -DataFrame udfResult = df.Select(udf(df["name"])); -``` - -This would return the below as the Dataframe `udfResult`: - -```text -+-------------+ -| name| -+-------------+ -|hello Michael| -| hello Andy| -| hello Justin| -+-------------+ -``` -To get a better understanding of how to implement UDFs, please take a look at the [UDF helper functions](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/Functions.cs#L3616) and some [test examples](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark.E2ETest/UdfTests/UdfSimpleTypesTests.cs#L49). - -## UDF serialization - -Since UDFs are functions that need to be executed on the workers, they have to be serialized and sent to the workers as part of the payload from the driver. This involves serializing the [delegate](https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/delegates/) which is a reference to the method, along with its [target](https://docs.microsoft.com/en-us/dotnet/api/system.delegate.target?view=netframework-4.8) which is the class instance on which the current delegate invokes the instance method. Please take a look at this [code](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs#L149) to get a better understanding of how UDF serialization is being done. - -## Good to know while implementing UDFs - -One behavior to be aware of while implementing UDFs in .NET for Apache Spark is how the target of the UDF gets serialized. .NET for Apache Spark uses .NET Core, which does not support serializing delegates, so it is instead done by using reflection to serialize the target where the delegate is defined. When multiple delegates are defined in a common scope, they have a shared closure that becomes the target of reflection for serialization. Let's take an example to illustrate what that means. - -The following code snippet defines two string variables that are being referenced in two function delegates, that just return the respective strings as result: - -```csharp -using System; - -public class C { - public void M() { - string s1 = "s1"; - string s2 = "s2"; - Func a = str => s1; - Func b = str => s2; - } -} -``` - -The above C# code generates the following C# disassembly (credit source: [sharplab.io](sharplab.io)) code from the compiler: - -```csharp -public class C -{ - [CompilerGenerated] - private sealed class <>c__DisplayClass0_0 - { - public string s1; - - public string s2; - - internal string b__0(string str) - { - return s1; - } - - internal string b__1(string str) - { - return s2; - } - } - - public void M() - { - <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0(); - <>c__DisplayClass0_.s1 = "s1"; - <>c__DisplayClass0_.s2 = "s2"; - Func func = new Func(<>c__DisplayClass0_.b__0); - Func func2 = new Func(<>c__DisplayClass0_.b__1); - } -} -``` -As can be seen in the above IL code, both `func` and `func2` share the same closure `<>c__DisplayClass0_0`, which is the target that is serialized when serializing the delegates `func` and `func2`. Hence, even though `Func a` is only referencing `s1`, `s2` also gets serialized when sending over the bytes to the workers. - -This can lead to some unexpected behaviors at runtime (like in the case of using [broadcast variables](broadcast-guide.md)), which is why we recommend restricting the visibility of the variables used in a function to that function's scope. -Taking the above example to better explain what that means: - -Recommended user code to implement desired behavior of previous code snippet: - -```csharp -using System; - -public class C { - public void M() { - { - string s1 = "s1"; - Func a = str => s1; - } - { - string s2 = "s2"; - Func b = str => s2; - } - } -} -``` - -The above C# code generates the following C# disassembly (credit source: [sharplab.io](sharplab.io)) code from the compiler: - -```csharp -public class C -{ - [CompilerGenerated] - private sealed class <>c__DisplayClass0_0 - { - public string s1; - - internal string b__0(string str) - { - return s1; - } - } - - [CompilerGenerated] - private sealed class <>c__DisplayClass0_1 - { - public string s2; - - internal string b__1(string str) - { - return s2; - } - } - - public void M() - { - <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0(); - <>c__DisplayClass0_.s1 = "s1"; - Func func = new Func(<>c__DisplayClass0_.b__0); - <>c__DisplayClass0_1 <>c__DisplayClass0_2 = new <>c__DisplayClass0_1(); - <>c__DisplayClass0_2.s2 = "s2"; - Func func2 = new Func(<>c__DisplayClass0_2.b__1); - } -} -``` - -Here we see that `func` and `func2` no longer share a closure and have their own separate closures `<>c__DisplayClass0_0` and `<>c__DisplayClass0_1` respectively. When used as the target for serialization, nothing other than the referenced variables will get serialized for the delegate. - -This above behavior is important to keep in mind while implementing multiple UDFs in a common scope. -To learn more about UDFs in general, please review the following articles that explain UDFs and how to use them: [UDFs in databricks(scala)](https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html), [Spark UDFs and some gotchas](https://medium.com/@achilleus/spark-udfs-we-can-use-them-but-should-we-use-them-2c5a561fde6d). \ No newline at end of file From 49a437d79701d8e1abf0d9ec80d873e33e912592 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 6 May 2020 10:39:32 -0700 Subject: [PATCH 03/18] Prep changes for 0.11.0 version --- README.md | 2 +- benchmark/scala/pom.xml | 2 +- docs/release-notes/0.11/release-0.11.md | 55 +++++++++++++++++++++++++ eng/Versions.props | 2 +- src/scala/pom.xml | 2 +- 5 files changed, 59 insertions(+), 4 deletions(-) create mode 100644 docs/release-notes/0.11/release-0.11.md diff --git a/README.md b/README.md index 299e1c94e..2d5638a97 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ 2.3.* - v0.10.0 + v0.11.0 2.4.0 diff --git a/benchmark/scala/pom.xml b/benchmark/scala/pom.xml index dcedfa5a9..56b8dc1ea 100644 --- a/benchmark/scala/pom.xml +++ b/benchmark/scala/pom.xml @@ -3,7 +3,7 @@ 4.0.0 com.microsoft.spark microsoft-spark-benchmark - 0.10.0 + 0.11.0 2019 UTF-8 diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md new file mode 100644 index 000000000..e80ace776 --- /dev/null +++ b/docs/release-notes/0.11/release-0.11.md @@ -0,0 +1,55 @@ +# .NET for Apache Spark 0.11 Release Notes + +### New Features and Improvements + +* Streamline logging when there is a failure ([#439](https://github.com/dotnet/spark/pull/439)) +* Ability to pass and return corefxlab DataFrames to UDF APIs ([#277](https://github.com/dotnet/spark/pull/277)) +* Refactor the DataFrame APIs to use the latest versions where possible ([#452](https://github.com/dotnet/spark/pull/452)) +* Supporting ML TF-IDF (Term frequency-inverse document frequency) feature vectorization method ([#394](https://github.com/dotnet/spark/pull/394)) +* Support for TimestampType in `DataFrame.Collect()`, `CreateDataFrame` and UDFs ([#428](https://github.com/dotnet/spark/pull/428)) +* Support for Broadcast Variables ([#414](https://github.com/dotnet/spark/pull/414)) +* Implement ML feature Word2Vec ([#491](https://github.com/dotnet/spark/pull/491)) + + +### Breaking Changes + +* None + +### Supported Spark Versions + +The following table outlines the supported Spark versions along with the microsoft-spark JAR to use with: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Spark Versionmicrosoft-spark JAR
2.3.*microsoft-spark-2.3.x-0.11.0.jar
2.4.0microsoft-spark-2.4.x-0.11.0.jar
2.4.1
2.4.3
2.4.4
2.4.5
2.4.2Not supported
diff --git a/eng/Versions.props b/eng/Versions.props index f506459e1..18d664dda 100644 --- a/eng/Versions.props +++ b/eng/Versions.props @@ -1,7 +1,7 @@ - 0.10.0 + 0.11.0 prerelease $(RestoreSources); diff --git a/src/scala/pom.xml b/src/scala/pom.xml index ddcd7645f..34ee5c338 100644 --- a/src/scala/pom.xml +++ b/src/scala/pom.xml @@ -7,7 +7,7 @@ ${microsoft-spark.version} UTF-8 - 0.10.0 + 0.11.0 From bde723130e34fe4aed1f6bafa3d805d5c27fffd6 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 6 May 2020 12:19:10 -0700 Subject: [PATCH 04/18] Adding table for breaking changes --- docs/release-notes/0.11/release-0.11.md | 33 ++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index e80ace776..730518f78 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -2,18 +2,45 @@ ### New Features and Improvements -* Streamline logging when there is a failure ([#439](https://github.com/dotnet/spark/pull/439)) * Ability to pass and return corefxlab DataFrames to UDF APIs ([#277](https://github.com/dotnet/spark/pull/277)) -* Refactor the DataFrame APIs to use the latest versions where possible ([#452](https://github.com/dotnet/spark/pull/452)) * Supporting ML TF-IDF (Term frequency-inverse document frequency) feature vectorization method ([#394](https://github.com/dotnet/spark/pull/394)) * Support for TimestampType in `DataFrame.Collect()`, `CreateDataFrame` and UDFs ([#428](https://github.com/dotnet/spark/pull/428)) * Support for Broadcast Variables ([#414](https://github.com/dotnet/spark/pull/414)) * Implement ML feature Word2Vec ([#491](https://github.com/dotnet/spark/pull/491)) +* Streamline logging when there is a failure ([#439](https://github.com/dotnet/spark/pull/439)) ### Breaking Changes -* None +* SparkSession.Catalog call changed from a method to a property ([#508](https://github.com/dotnet/spark/pull/508)) + + + + + + + + + + + + + + + + + + + + + + + + + + +
Oldest compatible .NET for Apache Spark versionIncompatible features
v0.9.0microsoft-spark-2.3.x-0.11.0.jar
DataFrame with Grouped Map UDF
DataFrame with Vector UDF
Support for Broadcast Variables
Support for TimestampType
+ ### Supported Spark Versions From 598ea15248809759f864bb15dc6980c0e68b3715 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 6 May 2020 12:21:27 -0700 Subject: [PATCH 05/18] table formatting error --- docs/release-notes/0.11/release-0.11.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index 730518f78..6c1e46f16 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -24,9 +24,6 @@ v0.9.0 - microsoft-spark-2.3.x-0.11.0.jar - - DataFrame with Grouped Map UDF From 4ddfb2c87c4049aa776bf816f0138048c6525175 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 6 May 2020 12:43:49 -0700 Subject: [PATCH 06/18] Added Compatibility section --- docs/release-notes/0.11/release-0.11.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index 6c1e46f16..e75f4a69f 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -14,10 +14,12 @@ * SparkSession.Catalog call changed from a method to a property ([#508](https://github.com/dotnet/spark/pull/508)) +### Compatibility + - + From d748beb6fd075f867f0540e34d05343ad1ebc722 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 6 May 2020 12:55:11 -0700 Subject: [PATCH 07/18] Added description for compatibility table --- docs/release-notes/0.11/release-0.11.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index e75f4a69f..eb2af3507 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -16,6 +16,8 @@ ### Compatibility +The following table describes the oldest version of the worker that this current version is compatible with, excluding some incompatible features as shown below. +
Oldest compatible .NET for Apache Spark versionOldest compatible .NET for Apache Spark worker Incompatible features
From dfadf09700f496949b3da2e00a53232e764e44c7 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 6 May 2020 13:24:40 -0700 Subject: [PATCH 08/18] Added link to PRs of incompatible features --- docs/release-notes/0.11/release-0.11.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index eb2af3507..53d266f55 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -28,16 +28,16 @@ The following table describes the oldest version of the worker that this current - + - + - + - +
v0.9.0DataFrame with Grouped Map UDFDataFrame with Grouped Map UDF ([PR](https://github.com/dotnet/spark/pull/277))
DataFrame with Vector UDFDataFrame with Vector UDF ([PR](https://github.com/dotnet/spark/pull/277))
Support for Broadcast VariablesSupport for Broadcast Variables ([PR](https://github.com/dotnet/spark/pull/414))
Support for TimestampTypeSupport for TimestampType ([PR](https://github.com/dotnet/spark/pull/428))
From e9eb4c814c0c209fae984cc1e225b04853ae8d79 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 6 May 2020 13:27:02 -0700 Subject: [PATCH 09/18] testing remove the link name --- docs/release-notes/0.11/release-0.11.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index 53d266f55..de80948f3 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -28,7 +28,7 @@ The following table describes the oldest version of the worker that this current v0.9.0 - DataFrame with Grouped Map UDF ([PR](https://github.com/dotnet/spark/pull/277)) + DataFrame with Grouped Map UDF ([#277](https://github.com/dotnet/spark/pull/277)) DataFrame with Vector UDF ([PR](https://github.com/dotnet/spark/pull/277)) From d27baaed815fdc886d1e42d8b5825523e29ce384 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 6 May 2020 13:29:35 -0700 Subject: [PATCH 10/18] test 2 --- docs/release-notes/0.11/release-0.11.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index de80948f3..1f160c666 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -28,7 +28,7 @@ The following table describes the oldest version of the worker that this current v0.9.0 - DataFrame with Grouped Map UDF ([#277](https://github.com/dotnet/spark/pull/277)) + DataFrame with Grouped Map UDF [https://github.com/dotnet/spark/pull/277](PR) DataFrame with Vector UDF ([PR](https://github.com/dotnet/spark/pull/277)) From fb625b334d921b107fa9c40439e9e944a30e4157 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 6 May 2020 13:33:52 -0700 Subject: [PATCH 11/18] test3 --- docs/release-notes/0.11/release-0.11.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index 1f160c666..65d62f560 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -28,7 +28,7 @@ The following table describes the oldest version of the worker that this current v0.9.0 - DataFrame with Grouped Map UDF [https://github.com/dotnet/spark/pull/277](PR) + DataFrame with Grouped Map UDF [PR](https://github.com/dotnet/spark/pull/277) DataFrame with Vector UDF ([PR](https://github.com/dotnet/spark/pull/277)) From 0dfe622103189a6250f0d1e55d22b78cd662e378 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 6 May 2020 13:50:16 -0700 Subject: [PATCH 12/18] using href tags --- docs/release-notes/0.11/release-0.11.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index 65d62f560..81169d21e 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -27,8 +27,8 @@ The following table describes the oldest version of the worker that this current - v0.9.0 - DataFrame with Grouped Map UDF [PR](https://github.com/dotnet/spark/pull/277) + v0.9.0 + DataFrame with Grouped Map UDF PR DataFrame with Vector UDF ([PR](https://github.com/dotnet/spark/pull/277)) From 62f91c4410f17f74956c19c90d8430e6c5d0d11a Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 6 May 2020 13:53:06 -0700 Subject: [PATCH 13/18] using href tags to remove link url from table --- docs/release-notes/0.11/release-0.11.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index 81169d21e..5029a47fe 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -28,16 +28,16 @@ The following table describes the oldest version of the worker that this current v0.9.0 - DataFrame with Grouped Map UDF PR + DataFrame with Grouped Map UDF (#277) - DataFrame with Vector UDF ([PR](https://github.com/dotnet/spark/pull/277)) + DataFrame with Vector UDF (#277) - Support for Broadcast Variables ([PR](https://github.com/dotnet/spark/pull/414)) + Support for Broadcast Variables (#414) - Support for TimestampType ([PR](https://github.com/dotnet/spark/pull/428)) + Support for TimestampType (#428) From 63f2cac01389a8fed1d3b8899b62fcda5311af42 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 6 May 2020 14:20:28 -0700 Subject: [PATCH 14/18] PR review changes --- docs/release-notes/0.11/release-0.11.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index 5029a47fe..d037e04c7 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -2,17 +2,17 @@ ### New Features and Improvements -* Ability to pass and return corefxlab DataFrames to UDF APIs ([#277](https://github.com/dotnet/spark/pull/277)) +* Ability to pass and return [corefxlab](https://github.com/dotnet/corefxlab) DataFrames to UDF APIs ([#277](https://github.com/dotnet/spark/pull/277)) * Supporting ML TF-IDF (Term frequency-inverse document frequency) feature vectorization method ([#394](https://github.com/dotnet/spark/pull/394)) * Support for TimestampType in `DataFrame.Collect()`, `CreateDataFrame` and UDFs ([#428](https://github.com/dotnet/spark/pull/428)) * Support for Broadcast Variables ([#414](https://github.com/dotnet/spark/pull/414)) -* Implement ML feature Word2Vec ([#491](https://github.com/dotnet/spark/pull/491)) +* Support for ML feature Word2Vec ([#491](https://github.com/dotnet/spark/pull/491)) * Streamline logging when there is a failure ([#439](https://github.com/dotnet/spark/pull/439)) ### Breaking Changes -* SparkSession.Catalog call changed from a method to a property ([#508](https://github.com/dotnet/spark/pull/508)) +* `SparkSession.Catalog` is changed from a method to a property ([#508](https://github.com/dotnet/spark/pull/508)) ### Compatibility From e06399aa7aafa42770b8dab579a897206dfd753f Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 6 May 2020 14:28:19 -0700 Subject: [PATCH 15/18] PR comment changes --- docs/release-notes/0.11/release-0.11.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index d037e04c7..8c1d5354a 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -3,7 +3,7 @@ ### New Features and Improvements * Ability to pass and return [corefxlab](https://github.com/dotnet/corefxlab) DataFrames to UDF APIs ([#277](https://github.com/dotnet/spark/pull/277)) -* Supporting ML TF-IDF (Term frequency-inverse document frequency) feature vectorization method ([#394](https://github.com/dotnet/spark/pull/394)) +* Support for ML TF-IDF (Term frequency-inverse document frequency) feature vectorization method ([#394](https://github.com/dotnet/spark/pull/394)) * Support for TimestampType in `DataFrame.Collect()`, `CreateDataFrame` and UDFs ([#428](https://github.com/dotnet/spark/pull/428)) * Support for Broadcast Variables ([#414](https://github.com/dotnet/spark/pull/414)) * Support for ML feature Word2Vec ([#491](https://github.com/dotnet/spark/pull/491)) @@ -16,7 +16,9 @@ ### Compatibility -The following table describes the oldest version of the worker that this current version is compatible with, excluding some incompatible features as shown below. +#### Backward compatibility + +The following table describes the oldest version of the worker that the current version is compatible with, along with new features that are incompatible with the worker. @@ -42,6 +44,8 @@ The following table describes the oldest version of the worker that this current
+#### Forward compatibility + ### Supported Spark Versions From b843cd841f4bbf33da0e4e05505b5567e229d961 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 6 May 2020 14:38:57 -0700 Subject: [PATCH 16/18] PR comment changes --- docs/release-notes/0.11/release-0.11.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index 8c1d5354a..4e6a99d2e 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -23,7 +23,7 @@ The following table describes the oldest version of the worker that the current - + From a8dab1fcc4e0a0c3ccf88ee47bc242c22430ef7c Mon Sep 17 00:00:00 2001 From: Elva Liu Date: Wed, 6 May 2020 16:17:26 -0700 Subject: [PATCH 17/18] add forward compatibility table --- docs/release-notes/0.11/release-0.11.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index 4e6a99d2e..78d38f338 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -46,6 +46,20 @@ The following table describes the oldest version of the worker that the current #### Forward compatibility +The following table describes the oldest version of .NET for Apache Spark release that the current version is compatible with. + +
Oldest compatible .NET for Apache Spark workerOldest compatible Microsoft.Spark.Worker version Incompatible features
+ + + + + + + + + + +
Oldest compatible .NET for Apache Spark release version
v0.9.0
### Supported Spark Versions @@ -84,4 +98,4 @@ The following table outlines the supported Spark versions along with the microso Not supported - + \ No newline at end of file From da5585a955a9fdb244f7080190f74feadeac7dd9 Mon Sep 17 00:00:00 2001 From: elvaliuliuliu <47404285+elvaliuliuliu@users.noreply.github.com> Date: Wed, 6 May 2020 16:39:39 -0700 Subject: [PATCH 18/18] resolve comments --- docs/release-notes/0.11/release-0.11.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/release-notes/0.11/release-0.11.md b/docs/release-notes/0.11/release-0.11.md index 78d38f338..e5fd3efa9 100644 --- a/docs/release-notes/0.11/release-0.11.md +++ b/docs/release-notes/0.11/release-0.11.md @@ -46,7 +46,7 @@ The following table describes the oldest version of the worker that the current #### Forward compatibility -The following table describes the oldest version of .NET for Apache Spark release that the current version is compatible with. +The following table describes the oldest version of .NET for Apache Spark release that the current worker is compatible with. @@ -56,7 +56,7 @@ The following table describes the oldest version of .NET for Apache Spark releas - +
v0.9.0v0.9.0
@@ -98,4 +98,4 @@ The following table outlines the supported Spark versions along with the microso Not supported - \ No newline at end of file +