Version 0.33.0
apply and transform Improvements
We added supports to have positional/keyword arguments for apply, apply_batch, transform, and transform_batch in DataFrame, Series, and GroupBy. (#1484, #1485, #1486)
>>> ks.range(10).apply(lambda a, b, c: a + b + c, args=(1,), c=3)
id
0 4
1 5
2 6
3 7
4 8
5 9
6 10
7 11
8 12
9 13>>> ks.range(10).transform_batch(lambda pdf, a, b, c: pdf.id + a + b + c, 1, 2, c=3)
0 6
1 7
2 8
3 9
4 10
5 11
6 12
7 13
8 14
9 15
Name: id, dtype: int64>>> kdf = ks.DataFrame(
... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
... columns=["a", "b", "c"])
>>> kdf.groupby(["a", "b"]).apply(lambda x, y, z: x + x.min() + y + z, 1, z=2)
a b c
0 5 5 5
1 7 5 11
2 9 7 21
3 11 9 35
4 13 13 53
5 15 19 75Spark Schema
We add spark_schema and print_schema to know the underlying Spark Schema. (#1446)
>>> kdf = ks.DataFrame({'a': list('abc'),
... 'b': list(range(1, 4)),
... 'c': np.arange(3, 6).astype('i1'),
... 'd': np.arange(4.0, 7.0, dtype='float64'),
... 'e': [True, False, True],
... 'f': pd.date_range('20130101', periods=3)},
... columns=['a', 'b', 'c', 'd', 'e', 'f'])
>>> # Print the schema out in Spark’s DDL formatted string
>>> kdf.spark_schema().simpleString()
'struct<a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'
>>> kdf.spark_schema(index_col='index').simpleString()
'struct<index:bigint,a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'
>>> # Print out the schema as same as DataFrame.printSchema()
>>> kdf.print_schema()
root
|-- a: string (nullable = false)
|-- b: long (nullable = false)
|-- c: byte (nullable = false)
|-- d: double (nullable = false)
|-- e: boolean (nullable = false)
|-- f: timestamp (nullable = false)
>>> kdf.print_schema(index_col='index')
root
|-- index: long (nullable = false)
|-- a: string (nullable = false)
|-- b: long (nullable = false)
|-- c: byte (nullable = false)
|-- d: double (nullable = false)
|-- e: boolean (nullable = false)
|-- f: timestamp (nullable = false)GroupBy Improvements
We fixed many bugs of GroupBy as listed below.
- Fix groupby when as_index=False. (#1457)
- Make groupby.apply in pandas<0.25 run the function only once per group. (#1462)
- Fix Series.groupby on the Series from different DataFrames. (#1460)
- Fix GroupBy.head to recognize agg_columns. (#1474)
- Fix GroupBy.filter to follow complex group keys. (#1471)
- Fix GroupBy.transform to follow complex group keys. (#1472)
- Fix GroupBy.apply to follow complex group keys. (#1473)
- Fix GroupBy.fillna to use GroupBy._apply_series_op. (#1481)
- Fix GroupBy.filter and apply to handle agg_columns. (#1480)
- Fix GroupBy apply, filter, and head to ignore temp columns when ops from different DataFrames. (#1488)
- Fix GroupBy functions which need natural orderings to follow the order when opts from different DataFrames. (#1490)
Other new features and improvements
We added the following new feature:
SeriesGroupBy:
filter(#1483)
Other improvements
- dtype for DateType should be np.dtype("object"). (#1447)
- Make reset_index disallow the same name but allow it when drop=True. (#1455)
- Fix named aggregation for MultiIndex (#1435)
- Raise ValueError that is not raised now (#1461)
- Fix get dummies when uses the prefix parameter whose type is dict (#1478)
- Simplify DataFrame.columns setter. (#1489)