From 4f8e85a5a54948105400006c263b1bb07e8305b7 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Sun, 19 Oct 2014 23:01:41 +0200 Subject: [PATCH] DOC/REL: clean-up and restructure v0.15.0 whatsnew file (GH8477) --- doc/source/api.rst | 2 + doc/source/whatsnew/v0.10.0.txt | 1 + doc/source/whatsnew/v0.15.0.txt | 906 ++++++++++++++++---------------- 3 files changed, 469 insertions(+), 440 deletions(-) diff --git a/doc/source/api.rst b/doc/source/api.rst index 2e913d8aae4da..f8068ebc38fa9 100644 --- a/doc/source/api.rst +++ b/doc/source/api.rst @@ -190,6 +190,8 @@ Standard moving window functions rolling_quantile rolling_window +.. _api.functions_expanding: + Standard expanding window functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/doc/source/whatsnew/v0.10.0.txt b/doc/source/whatsnew/v0.10.0.txt index 93ab3b912030d..04159186084f5 100644 --- a/doc/source/whatsnew/v0.10.0.txt +++ b/doc/source/whatsnew/v0.10.0.txt @@ -48,6 +48,7 @@ want to broadcast, we are phasing out this special case (Zen of Python: talking about: .. ipython:: python + :okwarning: import pandas as pd df = pd.DataFrame(np.random.randn(6, 4), diff --git a/doc/source/whatsnew/v0.15.0.txt b/doc/source/whatsnew/v0.15.0.txt index c8c7ed3b5011e..5d7598b749feb 100644 --- a/doc/source/whatsnew/v0.15.0.txt +++ b/doc/source/whatsnew/v0.15.0.txt @@ -17,27 +17,23 @@ users upgrade to this version. - The ``Categorical`` type was integrated as a first-class pandas type, see :ref:`here ` - New scalar type ``Timedelta``, and a new index type ``TimedeltaIndex``, see :ref:`here ` - - New DataFrame default display for ``df.info()`` to include memory usage, see :ref:`Memory Usage ` - New datetimelike properties accessor ``.dt`` for Series, see :ref:`Datetimelike Properties ` - - Split indexing documentation into :ref:`Indexing and Selecting Data ` and :ref:`MultiIndex / Advanced Indexing ` - - Split out string methods documentation into :ref:`Working with Text Data ` + - New DataFrame default display for ``df.info()`` to include memory usage, see :ref:`Memory Usage ` - ``read_csv`` will now by default ignore blank lines when parsing, see :ref:`here ` - API change in using Indexes in set operations, see :ref:`here ` + - Enhancements in the handling of timezones, see :ref:`here ` + - A lot of improvements to the rolling and expanding moment funtions, see :ref:`here ` - Internal refactoring of the ``Index`` class to no longer sub-class ``ndarray``, see :ref:`Internal Refactoring ` - dropping support for ``PyTables`` less than version 3.0.0, and ``numexpr`` less than version 2.1 (:issue:`7990`) + - Split indexing documentation into :ref:`Indexing and Selecting Data ` and :ref:`MultiIndex / Advanced Indexing ` + - Split out string methods documentation into :ref:`Working with Text Data ` +- Check the :ref:`API Changes ` and :ref:`deprecations ` before updating + - :ref:`Other Enhancements ` -- :ref:`API Changes ` - -- :ref:`Timezone API Change ` - -- :ref:`Rolling/Expanding Moments API Changes ` - - :ref:`Performance Improvements ` -- :ref:`Deprecations ` - - :ref:`Bug Fixes ` .. warning:: @@ -49,285 +45,169 @@ users upgrade to this version. .. warning:: The refactorings in :class:`~pandas.Categorical` changed the two argument constructor from - "codes/labels and levels" to "values and levels". This can lead to subtle bugs. If you use + "codes/labels and levels" to "values and levels (now called 'categories')". This can lead to subtle bugs. If you use :class:`~pandas.Categorical` directly, please audit your code before updating to this pandas version and change it to use the :meth:`~pandas.Categorical.from_codes` constructor. See more on ``Categorical`` :ref:`here ` -.. _whatsnew_0150.api: - -API changes -~~~~~~~~~~~ -- :func:`describe` on mixed-types DataFrames is more flexible. Type-based column filtering is now possible via the ``include``/``exclude`` arguments. - See the :ref:`docs ` (:issue:`8164`). - - .. ipython:: python - - df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8, - 'catB': ['a', 'b', 'c', 'd'] * 6, - 'numC': np.arange(24), - 'numD': np.arange(24.) + .5}) - df.describe(include=["object"]) - df.describe(include=["number", "object"], exclude=["float"]) - - Requesting all columns is possible with the shorthand 'all' - - .. ipython:: python - - df.describe(include='all') - - Without those arguments, 'describe` will behave as before, including only numerical columns or, if none are, only categorical columns. See also the :ref:`docs ` - -- Passing multiple levels to :meth:`~pandas.DataFrame.stack()` will now work when multiple level - numbers are passed (:issue:`7660`), and will raise a ``ValueError`` when the - levels aren't all level names or all level numbers. See - :ref:`Reshaping by stacking and unstacking `. - -- :func:`set_names`, :func:`set_labels`, and :func:`set_levels` methods now take an optional ``level`` keyword argument to all modification of specific level(s) of a MultiIndex. Additionally :func:`set_names` now accepts a scalar string value when operating on an ``Index`` or on a specific level of a ``MultiIndex`` (:issue:`7792`) - - .. ipython:: python - - idx = MultiIndex.from_product([['a'], range(3), list("pqr")], names=['foo', 'bar', 'baz']) - idx.set_names('qux', level=0) - idx.set_names(['qux','baz'], level=[0,1]) - idx.set_levels(['a','b','c'], level='bar') - idx.set_levels([['a','b','c'],[1,2,3]], level=[1,2]) - -- Raise a ``ValueError`` in ``df.to_hdf`` with 'fixed' format, if ``df`` has non-unique columns as the resulting file will be broken (:issue:`7761`) - -.. _whatsnew_0150.blanklines: - -- Made both the C-based and Python engines for `read_csv` and `read_table` ignore empty lines in input as well as - whitespace-filled lines, as long as ``sep`` is not whitespace. This is an API change - that can be controlled by the keyword parameter ``skip_blank_lines``. See :ref:`the docs ` (:issue:`4466`) - -- Bug in passing a ``DatetimeIndex`` with a timezone that was not being retained in DataFrame construction from a dict (:issue:`7822`) - - In prior versions this would drop the timezone. - - .. ipython:: python - - i = date_range('1/1/2011', periods=3, freq='10s', tz = 'US/Eastern') - i - df = DataFrame( {'a' : i } ) - df - df.dtypes - - This behavior is unchanged. - - .. ipython:: python - - df = DataFrame( ) - df['a'] = i - df - df.dtypes - -- ``SettingWithCopy`` raise/warnings (according to the option ``mode.chained_assignment``) will now be issued when setting a value on a sliced mixed-dtype DataFrame using chained-assignment. (:issue:`7845`, :issue:`7950`) - - .. code-block:: python - - In [1]: df = DataFrame(np.arange(0,9), columns=['count']) - - In [2]: df['group'] = 'b' - - In [3]: df.iloc[0:5]['group'] = 'a' - /usr/local/bin/ipython:1: SettingWithCopyWarning: - A value is trying to be set on a copy of a slice from a DataFrame. - Try using .loc[row_indexer,col_indexer] = value instead - - See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy - -- The ``infer_types`` argument to :func:`~pandas.io.html.read_html` now has no - effect (:issue:`7762`, :issue:`7032`). - -- ``DataFrame.to_stata`` and ``StataWriter`` check string length for - compatibility with limitations imposed in dta files where fixed-width - strings must contain 244 or fewer characters. Attempting to write Stata - dta files with strings longer than 244 characters raises a ``ValueError``. (:issue:`7858`) - -- ``read_stata`` and ``StataReader`` can import missing data information into a - ``DataFrame`` by setting the argument ``convert_missing`` to ``True``. When - using this options, missing values are returned as ``StataMissingValue`` - objects and columns containing missing values have ``object`` data type. (:issue:`8045`) - -- ``Index.isin`` now supports a ``level`` argument to specify which index level - to use for membership tests (:issue:`7892`, :issue:`7890`) - - .. code-block:: python - - In [1]: idx = MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]) - - In [2]: idx.values - Out[2]: array([(0, 'a'), (0, 'b'), (0, 'c'), (1, 'a'), (1, 'b'), (1, 'c')], dtype=object) - - In [3]: idx.isin(['a', 'c', 'e'], level=1) - Out[3]: array([ True, False, True, True, False, True], dtype=bool) - -- ``merge``, ``DataFrame.merge``, and ``ordered_merge`` now return the same type - as the ``left`` argument. (:issue:`7737`) -- Histogram from ``DataFrame.plot`` with ``kind='hist'`` (:issue:`7809`), See :ref:`the docs`. -- Boxplot from ``DataFrame.plot`` with ``kind='box'`` (:issue:`7998`), See :ref:`the docs`. -- Consistency when indexing with ``.loc`` and a list-like indexer when no values are found. - - .. ipython:: python +New features +~~~~~~~~~~~~ - df = DataFrame([['a'],['b']],index=[1,2]) - df +.. _whatsnew_0150.cat: - In prior versions there was a difference in these two constructs: +Categoricals in Series/DataFrame +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - - ``df.loc[[3]]`` would return a frame reindexed by 3 (with all ``np.nan`` values) - - ``df.loc[[3],:]`` would raise ``KeyError``. +:class:`~pandas.Categorical` can now be included in `Series` and `DataFrames` and gained new +methods to manipulate. Thanks to Jan Schulz for much of this API/implementation. (:issue:`3943`, :issue:`5313`, :issue:`5314`, +:issue:`7444`, :issue:`7839`, :issue:`7848`, :issue:`7864`, :issue:`7914`, :issue:`7768`, :issue:`8006`, :issue:`3678`, +:issue:`8075`, :issue:`8076`, :issue:`8143`, :issue:`8453`, :issue:`8518`). - Both will now raise a ``KeyError``. The rule is that *at least 1* indexer must be found when using a list-like and ``.loc`` (:issue:`7999`) +For full docs, see the :ref:`categorical introduction ` and the +:ref:`API documentation `. - Furthermore in prior versions these were also different: +.. ipython:: python - - ``df.loc[[1,3]]`` would return a frame reindexed by [1,3] - - ``df.loc[[1,3],:]`` would raise ``KeyError``. + df = DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']}) - Both will now return a frame reindex by [1,3]. E.g. + df["grade"] = df["raw_grade"].astype("category") + df["grade"] - .. ipython:: python + # Rename the categories + df["grade"].cat.categories = ["very good", "good", "very bad"] - df.loc[[1,3]] - df.loc[[1,3],:] + # Reorder the categories and simultaneously add the missing categories + df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) + df["grade"] + df.sort("grade") + df.groupby("grade").size() - This can also be seen in multi-axis indexing with a ``Panel``. +- ``pandas.core.group_agg`` and ``pandas.core.factor_agg`` were removed. As an alternative, construct + a dataframe and use ``df.groupby().agg()``. - .. ipython:: python +- Supplying "codes/labels and levels" to the :class:`~pandas.Categorical` constructor is not + supported anymore. Supplying two arguments to the constructor is now interpreted as + "values and levels (now called 'categories')". Please change your code to use the :meth:`~pandas.Categorical.from_codes` + constructor. - p = Panel(np.arange(2*3*4).reshape(2,3,4), - items=['ItemA','ItemB'], - major_axis=[1,2,3], - minor_axis=['A','B','C','D']) - p +- The ``Categorical.labels`` attribute was renamed to ``Categorical.codes`` and is read + only. If you want to manipulate codes, please use one of the + :ref:`API methods on Categoricals `. - The following would raise ``KeyError`` prior to 0.15.0: +- The ``Categorical.levels`` attribute is renamed to ``Categorical.categories``. - .. ipython:: python - p.loc[['ItemA','ItemD'],:,'D'] +.. _whatsnew_0150.timedeltaindex: - Furthermore, ``.loc`` will raise If no values are found in a multi-index with a list-like indexer: +TimedeltaIndex/Scalar +^^^^^^^^^^^^^^^^^^^^^ - .. ipython:: python - :okexcept: +We introduce a new scalar type ``Timedelta``, which is a subclass of ``datetime.timedelta``, and behaves in a similar manner, +but allows compatibility with ``np.timedelta64`` types as well as a host of custom representation, parsing, and attributes. +This type is very similar to how ``Timestamp`` works for ``datetimes``. It is a nice-API box for the type. See the :ref:`docs `. +(:issue:`3009`, :issue:`4533`, :issue:`8209`, :issue:`8187`, :issue:`8190`, :issue:`7869`, :issue:`7661`, :issue:`8345`, :issue:`8471`) - s = Series(np.arange(3,dtype='int64'), - index=MultiIndex.from_product([['A'],['foo','bar','baz']], - names=['one','two']) - ).sortlevel() - s - try: - s.loc[['D']] - except KeyError as e: - print("KeyError: " + str(e)) +.. warning:: -- ``Index`` now supports ``duplicated`` and ``drop_duplicates``. (:issue:`4060`) + ``Timedelta`` scalars (and ``TimedeltaIndex``) component fields are *not the same* as the component fields on a ``datetime.timedelta`` object. For example, ``.seconds`` on a ``datetime.timedelta`` object returns the total number of seconds combined between ``hours``, ``minutes`` and ``seconds``. In contrast, the pandas ``Timedelta`` breaks out hours, minutes, microseconds and nanoseconds separately. - .. ipython:: python + .. ipython:: python - idx = Index([1, 2, 3, 4, 1, 2]) - idx - idx.duplicated() - idx.drop_duplicates() + # Timedelta accessor + tds = Timedelta('31 days 5 min 3 sec') + tds.minutes + tds.seconds -- Assigning values to ``None`` now considers the dtype when choosing an 'empty' value (:issue:`7941`). + # datetime.timedelta accessor + # this is 5 minutes * 60 + 3 seconds + tds.to_pytimedelta().seconds - Previously, assigning to ``None`` in numeric containers changed the - dtype to object (or errored, depending on the call). It now uses - ``NaN``: +.. warning:: - .. ipython:: python + Prior to 0.15.0 ``pd.to_timedelta`` would return a ``Series`` for list-like/Series input, and a ``np.timedelta64`` for scalar input. + It will now return a ``TimedeltaIndex`` for list-like input, ``Series`` for Series input, and ``Timedelta`` for scalar input. - s = Series([1, 2, 3]) - s.loc[0] = None - s + The arguments to ``pd.to_timedelta`` are now ``(arg,unit='ns',box=True,coerce=False)``, previously were ``(arg,box=True,unit='ns')`` as these are more logical. - ``NaT`` is now used similarly for datetime containers. +Consruct a scalar - For object containers, we now preserve ``None`` values (previously these - were converted to ``NaN`` values). +.. ipython:: python - .. ipython:: python + Timedelta('1 days 06:05:01.00003') + Timedelta('15.5us') + Timedelta('1 hour 15.5us') - s = Series(["a", "b", "c"]) - s.loc[0] = None - s + # negative Timedeltas have this string repr + # to be more consistent with datetime.timedelta conventions + Timedelta('-1us') - To insert a ``NaN``, you must explicitly use ``np.nan``. See the :ref:`docs `. + # a NaT + Timedelta('nan') -- Previously an enlargement with a mixed-dtype frame would act unlike ``.append`` which will preserve dtypes (related :issue:`2578`, :issue:`8176`): +Access fields for a ``Timedelta`` - .. ipython:: python +.. ipython:: python - df = DataFrame([[True, 1],[False, 2]], - columns=["female","fitness"]) - df - df.dtypes + td = Timedelta('1 hour 3m 15.5us') + td.hours + td.minutes + td.microseconds + td.nanoseconds - # dtypes are now preserved - df.loc[2] = df.loc[1] - df - df.dtypes +Construct a ``TimedeltaIndex`` -- In prior versions, updating a pandas object inplace would not reflect in other python references to this object. (:issue:`8511`,:issue:`5104`) +.. ipython:: python + :suppress: - .. ipython:: python + import datetime + from datetime import timedelta - s = Series([1, 2, 3]) - s2 = s - s += 1.5 +.. ipython:: python - Behavior prior to v0.15.0 + TimedeltaIndex(['1 days','1 days, 00:00:05', + np.timedelta64(2,'D'),timedelta(days=2,seconds=2)]) - .. code-block:: python +Constructing a ``TimedeltaIndex`` with a regular range +.. ipython:: python - # the original object - In [5]: s - Out[5]: - 0 2.5 - 1 3.5 - 2 4.5 - dtype: float64 + timedelta_range('1 days',periods=5,freq='D') + timedelta_range(start='1 days',end='2 days',freq='30T') +You can now use a ``TimedeltaIndex`` as the index of a pandas object - # a reference to the original object - In [7]: s2 - Out[7]: - 0 1 - 1 2 - 2 3 - dtype: int64 +.. ipython:: python - This is now the correct behavior + s = Series(np.arange(5), + index=timedelta_range('1 days',periods=5,freq='s')) + s - .. ipython:: python +You can select with partial string selections - # the original object - s +.. ipython:: python - # a reference to the original object - s2 + s['1 day 00:00:02'] + s['1 day':'1 day 00:00:02'] -- ``Series.to_csv()`` now returns a string when ``path=None``, matching the behaviour of ``DataFrame.to_csv()`` (:issue:`8215`). +Finally, the combination of ``TimedeltaIndex`` with ``DatetimeIndex`` allow certain combination operations that are ``NaT`` preserving: -- ``read_hdf`` now raises ``IOError`` when a file that doesn't exist is passed in. Previously, a new, empty file was created, and a ``KeyError`` raised (:issue:`7715`). +.. ipython:: python -- ``DataFrame.info()`` now ends its output with a newline character (:issue:`8114`) -- add ``copy=True`` argument to ``pd.concat`` to enable pass thru of complete blocks (:issue:`8252`) + tdi = TimedeltaIndex(['1 days',pd.NaT,'2 days']) + tdi.tolist() + dti = date_range('20130101',periods=3) + dti.tolist() + + (dti + tdi).tolist() + (dti - tdi).tolist() + +- iteration of a ``Series`` e.g. ``list(Series(...))`` of ``timedelta64[ns]`` would prior to v0.15.0 return ``np.timedelta64`` for each element. These will now be wrapped in ``Timedelta``. -- Added support for numpy 1.8+ data types (``bool_``, ``int_``, ``float_``, ``string_``) for conversion to R dataframe (:issue:`8400`) -- Concatenating no objects will now raise a ``ValueError`` rather than a bare ``Exception``. -- Merge errors will now be sub-classes of ``ValueError`` rather than raw ``Exception`` (:issue:`8501`) -- ``DataFrame.plot`` and ``Series.plot`` keywords are now have consistent orders (:issue:`8037`) .. _whatsnew_0150.memory: Memory Usage -~~~~~~~~~~~~~ +^^^^^^^^^^^^ Implemented methods to find memory usage of a DataFrame. See the :ref:`FAQ ` for more. (:issue:`6852`). @@ -351,10 +231,11 @@ Additionally :meth:`~pandas.DataFrame.memory_usage` is an available method for a df.memory_usage(index=True) + .. _whatsnew_0150.dt: .dt accessor -~~~~~~~~~~~~ +^^^^^^^^^^^^ ``Series`` has gained an accessor to succinctly return datetime like properties for the *values* of the Series, if its a datetime/period like Series. (:issue:`7207`) This will return a Series, indexed like the existing Series. See the :ref:`docs ` @@ -408,10 +289,11 @@ The ``.dt`` accessor works for period and timedelta dtypes. s.dt.seconds s.dt.components + .. _whatsnew_0150.tz: -Timezone API changes -~~~~~~~~~~~~~~~~~~~~ +Timezone handling improvements +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - ``tz_localize(None)`` for tz-aware ``Timestamp`` and ``DatetimeIndex`` now removes timezone holding local time, previously this resulted in ``Exception`` or ``TypeError`` (:issue:`7812`) @@ -439,14 +321,15 @@ Timezone API changes - ``Timestamp.__repr__`` displays ``dateutil.tz.tzoffset`` info (:issue:`7907`) + .. _whatsnew_0150.roll: -Rolling/Expanding Moments API changes -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Rolling/Expanding Moments improvements +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - :func:`rolling_min`, :func:`rolling_max`, :func:`rolling_cov`, and :func:`rolling_corr` now return objects with all ``NaN`` when ``len(arg) < min_periods <= window`` rather - than raising. (This makes all rolling functions consistent in this behavior), (:issue:`7766`) + than raising. (This makes all rolling functions consistent in this behavior). (:issue:`7766`) Prior to 0.15.0 @@ -520,10 +403,7 @@ Rolling/Expanding Moments API changes rolling_window(s, window=3, win_type='triang', center=True) -- Removed ``center`` argument from :func:`expanding_max`, :func:`expanding_min`, :func:`expanding_sum`, - :func:`expanding_mean`, :func:`expanding_median`, :func:`expanding_std`, :func:`expanding_var`, - :func:`expanding_skew`, :func:`expanding_kurt`, :func:`expanding_quantile`, :func:`expanding_count`, - :func:`expanding_cov`, :func:`expanding_corr`, :func:`expanding_corr_pairwise`, and :func:`expanding_apply`, +- Removed ``center`` argument from all :func:`expanding_ ` functions (see :ref:`list `), as the results produced when ``center=True`` did not make much sense. (:issue:`7925`) - Added optional ``ddof`` argument to :func:`expanding_cov` and :func:`rolling_cov`. @@ -643,178 +523,307 @@ Rolling/Expanding Moments API changes See :ref:`Exponentially weighted moment functions ` for details. (:issue:`7912`) -.. _whatsnew_0150.refactoring: - -Internal Refactoring -~~~~~~~~~~~~~~~~~~~~ -In 0.15.0 ``Index`` has internally been refactored to no longer sub-class ``ndarray`` -but instead subclass ``PandasObject``, similarly to the rest of the pandas objects. This change allows very easy sub-classing and creation of new index types. This should be -a transparent change with only very limited API implications (:issue:`5080`, :issue:`7439`, :issue:`7796`, :issue:`8024`, :issue:`8367`, :issue:`7997`, :issue:`8522`) +.. _whatsnew_0150.sql: -- you may need to unpickle pandas version < 0.15.0 pickles using ``pd.read_pickle`` rather than ``pickle.load``. See :ref:`pickle docs ` -- when plotting with a ``PeriodIndex``. The ``matplotlib`` internal axes will now be arrays of ``Period`` rather than a ``PeriodIndex``. (this is similar to how a ``DatetimeIndex`` passes arrays of ``datetimes`` now) -- MultiIndexes will now raise similary to other pandas objects w.r.t. truth testing, See :ref:`here ` (:issue:`7897`). +Improvements in the sql io module +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -.. _whatsnew_0150.cat: +- Added support for a ``chunksize`` parameter to ``to_sql`` function. This allows DataFrame to be written in chunks and avoid packet-size overflow errors (:issue:`8062`). +- Added support for a ``chunksize`` parameter to ``read_sql`` function. Specifying this argument will return an iterator through chunks of the query result (:issue:`2908`). +- Added support for writing ``datetime.date`` and ``datetime.time`` object columns with ``to_sql`` (:issue:`6932`). +- Added support for specifying a ``schema`` to read from/write to with ``read_sql_table`` and ``to_sql`` (:issue:`7441`, :issue:`7952`). + For example: -Categoricals in Series/DataFrame -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + .. code-block:: python -:class:`~pandas.Categorical` can now be included in `Series` and `DataFrames` and gained new -methods to manipulate. Thanks to Jan Schulz for much of this API/implementation. (:issue:`3943`, :issue:`5313`, :issue:`5314`, -:issue:`7444`, :issue:`7839`, :issue:`7848`, :issue:`7864`, :issue:`7914`, :issue:`7768`, :issue:`8006`, :issue:`3678`, -:issue:`8075`, :issue:`8076`, :issue:`8143`, :issue:`8453`, :issue:`8518`). + df.to_sql('table', engine, schema='other_schema') + pd.read_sql_table('table', engine, schema='other_schema') -For full docs, see the :ref:`categorical introduction ` and the -:ref:`API documentation `. +- Added support for writing ``NaN`` values with ``to_sql`` (:issue:`2754`). +- Added support for writing datetime64 columns with ``to_sql`` for all database flavors (:issue:`7103`). -.. ipython:: python - df = DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']}) +.. _whatsnew_0150.api: - df["grade"] = df["raw_grade"].astype("category") - df["grade"] +Backwards incompatible API changes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - # Rename the categories - df["grade"].cat.categories = ["very good", "good", "very bad"] +.. _whatsnew_0150.api_breaking: - # Reorder the categories and simultaneously add the missing categories - df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) - df["grade"] - df.sort("grade") - df.groupby("grade").size() +Breaking changes +^^^^^^^^^^^^^^^^ -- ``pandas.core.group_agg`` and ``pandas.core.factor_agg`` were removed. As an alternative, construct - a dataframe and use ``df.groupby().agg()``. +API changes related to ``Categorical`` (see :ref:`here ` +for more details): -- Supplying "codes/labels and levels" to the :class:`~pandas.Categorical` constructor is not - supported anymore. Supplying two arguments to the constructor is now interpreted as - "values and levels". Please change your code to use the :meth:`~pandas.Categorical.from_codes` +- The ``Categorical`` constructor with two arguments changed from + "codes/labels and levels" to "values and levels (now called 'categories')". + This can lead to subtle bugs. If you use :class:`~pandas.Categorical` directly, + please audit your code by changing it to use the :meth:`~pandas.Categorical.from_codes` constructor. -- The ``Categorical.labels`` attribute was renamed to ``Categorical.codes`` and is read - only. If you want to manipulate codes, please use one of the - :ref:`API methods on Categoricals `. + An old function call like (prior to 0.15.0): -.. _whatsnew_0150.timedeltaindex: + .. code-block:: python -TimedeltaIndex/Scalar -~~~~~~~~~~~~~~~~~~~~~ + pd.Categorical([0,1,0,2,1], levels=['a', 'b', 'c']) -We introduce a new scalar type ``Timedelta``, which is a subclass of ``datetime.timedelta``, and behaves in a similar manner, -but allows compatibility with ``np.timedelta64`` types as well as a host of custom representation, parsing, and attributes. -This type is very similar to how ``Timestamp`` works for ``datetimes``. It is a nice-API box for the type. See the :ref:`docs `. -(:issue:`3009`, :issue:`4533`, :issue:`8209`, :issue:`8187`, :issue:`8190`, :issue:`7869`, :issue:`7661`, :issue:`8345`, :issue:`8471`) + will have to adapted to the following to keep the same behaviour: -.. warning:: + .. code-block:: python - ``Timedelta`` scalars (and ``TimedeltaIndex``) component fields are *not the same* as the component fields on a ``datetime.timedelta`` object. For example, ``.seconds`` on a ``datetime.timedelta`` object returns the total number of seconds combined between ``hours``, ``minutes`` and ``seconds``. In contrast, the pandas ``Timedelta`` breaks out hours, minutes, microseconds and nanoseconds separately. + In [2]: pd.Categorical.from_codes([0,1,0,2,1], categories=['a', 'b', 'c']) + Out[2]: + [a, b, a, c, b] + Categories (3, object): [a, b, c] - .. ipython:: python +API changes related to the introduction of the ``Timedelta`` scalar (see +:ref:`above ` for more details): + +- Prior to 0.15.0 :func:`to_timedelta` would return a ``Series`` for list-like/Series input, + and a ``np.timedelta64`` for scalar input. It will now return a ``TimedeltaIndex`` for + list-like input, ``Series`` for Series input, and ``Timedelta`` for scalar input. - # Timedelta accessor - tds = Timedelta('31 days 5 min 3 sec') - tds.minutes - tds.seconds +For API changes related to the rolling and expanding functions, see detailed overview :ref:`above `. - # datetime.timedelta accessor - # this is 5 minutes * 60 + 3 seconds - tds.to_pytimedelta().seconds +Other notable API changes: -.. warning:: +- Consistency when indexing with ``.loc`` and a list-like indexer when no values are found. - Prior to 0.15.0 ``pd.to_timedelta`` would return a ``Series`` for list-like/Series input, and a ``np.timedelta64`` for scalar input. - It will now return a ``TimedeltaIndex`` for list-like input, ``Series`` for Series input, and ``Timedelta`` for scalar input. + .. ipython:: python - The arguments to ``pd.to_timedelta`` are now ``(arg,unit='ns',box=True,coerce=False)``, previously were ``(arg,box=True,unit='ns')`` as these are more logical. + df = DataFrame([['a'],['b']],index=[1,2]) + df -Consruct a scalar + In prior versions there was a difference in these two constructs: -.. ipython:: python + - ``df.loc[[3]]`` would return a frame reindexed by 3 (with all ``np.nan`` values) + - ``df.loc[[3],:]`` would raise ``KeyError``. - Timedelta('1 days 06:05:01.00003') - Timedelta('15.5us') - Timedelta('1 hour 15.5us') + Both will now raise a ``KeyError``. The rule is that *at least 1* indexer must be found when using a list-like and ``.loc`` (:issue:`7999`) - # negative Timedeltas have this string repr - # to be more consistent with datetime.timedelta conventions - Timedelta('-1us') + Furthermore in prior versions these were also different: - # a NaT - Timedelta('nan') + - ``df.loc[[1,3]]`` would return a frame reindexed by [1,3] + - ``df.loc[[1,3],:]`` would raise ``KeyError``. + + Both will now return a frame reindex by [1,3]. E.g. + + .. ipython:: python + + df.loc[[1,3]] + df.loc[[1,3],:] + + This can also be seen in multi-axis indexing with a ``Panel``. + + .. ipython:: python + + p = Panel(np.arange(2*3*4).reshape(2,3,4), + items=['ItemA','ItemB'], + major_axis=[1,2,3], + minor_axis=['A','B','C','D']) + p + + The following would raise ``KeyError`` prior to 0.15.0: + + .. ipython:: python + + p.loc[['ItemA','ItemD'],:,'D'] + + Furthermore, ``.loc`` will raise If no values are found in a multi-index with a list-like indexer: + + .. ipython:: python + :okexcept: + + s = Series(np.arange(3,dtype='int64'), + index=MultiIndex.from_product([['A'],['foo','bar','baz']], + names=['one','two']) + ).sortlevel() + s + try: + s.loc[['D']] + except KeyError as e: + print("KeyError: " + str(e)) + +- Assigning values to ``None`` now considers the dtype when choosing an 'empty' value (:issue:`7941`). + + Previously, assigning to ``None`` in numeric containers changed the + dtype to object (or errored, depending on the call). It now uses + ``NaN``: + + .. ipython:: python + + s = Series([1, 2, 3]) + s.loc[0] = None + s + + ``NaT`` is now used similarly for datetime containers. + + For object containers, we now preserve ``None`` values (previously these + were converted to ``NaN`` values). + + .. ipython:: python + + s = Series(["a", "b", "c"]) + s.loc[0] = None + s + + To insert a ``NaN``, you must explicitly use ``np.nan``. See the :ref:`docs `. + +- In prior versions, updating a pandas object inplace would not reflect in other python references to this object. (:issue:`8511`, :issue:`5104`) + + .. ipython:: python + + s = Series([1, 2, 3]) + s2 = s + s += 1.5 + + Behavior prior to v0.15.0 + + .. code-block:: python + + + # the original object + In [5]: s + Out[5]: + 0 2.5 + 1 3.5 + 2 4.5 + dtype: float64 + + + # a reference to the original object + In [7]: s2 + Out[7]: + 0 1 + 1 2 + 2 3 + dtype: int64 + + This is now the correct behavior + + .. ipython:: python + + # the original object + s + + # a reference to the original object + s2 + +.. _whatsnew_0150.blanklines: + +- Made both the C-based and Python engines for `read_csv` and `read_table` ignore empty lines in input as well as + whitespace-filled lines, as long as ``sep`` is not whitespace. This is an API change + that can be controlled by the keyword parameter ``skip_blank_lines``. See :ref:`the docs ` (:issue:`4466`) + +- A timeseries/index localized to UTC when inserted into a Series/DataFrame will preserve the UTC timezone + and inserted as ``object`` dtype rather than being converted to a naive ``datetime64[ns]`` (:issue:`8411`). + +- Bug in passing a ``DatetimeIndex`` with a timezone that was not being retained in DataFrame construction from a dict (:issue:`7822`) + + In prior versions this would drop the timezone, now it retains the timezone, + but gives a column of ``object`` dtype: + + .. ipython:: python + + i = date_range('1/1/2011', periods=3, freq='10s', tz = 'US/Eastern') + i + df = DataFrame( {'a' : i } ) + df + df.dtypes + + Previously this would have yielded a column of ``datetime64`` dtype, but without timezone info. + + The behaviour of assigning a column to an existing dataframe as `df['a'] = i` + remains unchanged (this already returned an ``object`` column with a timezone). + +- When passing multiple levels to :meth:`~pandas.DataFrame.stack()`, it will now raise a ``ValueError`` when the + levels aren't all level names or all level numbers (:issue:`7660`). See + :ref:`Reshaping by stacking and unstacking `. -Access fields for a ``Timedelta`` +- Raise a ``ValueError`` in ``df.to_hdf`` with 'fixed' format, if ``df`` has non-unique columns as the resulting file will be broken (:issue:`7761`) -.. ipython:: python +- ``SettingWithCopy`` raise/warnings (according to the option ``mode.chained_assignment``) will now be issued when setting a value on a sliced mixed-dtype DataFrame using chained-assignment. (:issue:`7845`, :issue:`7950`) - td = Timedelta('1 hour 3m 15.5us') - td.hours - td.minutes - td.microseconds - td.nanoseconds + .. code-block:: python -Construct a ``TimedeltaIndex`` + In [1]: df = DataFrame(np.arange(0,9), columns=['count']) -.. ipython:: python - :suppress: + In [2]: df['group'] = 'b' - import datetime - from datetime import timedelta + In [3]: df.iloc[0:5]['group'] = 'a' + /usr/local/bin/ipython:1: SettingWithCopyWarning: + A value is trying to be set on a copy of a slice from a DataFrame. + Try using .loc[row_indexer,col_indexer] = value instead -.. ipython:: python + See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy - TimedeltaIndex(['1 days','1 days, 00:00:05', - np.timedelta64(2,'D'),timedelta(days=2,seconds=2)]) +- ``merge``, ``DataFrame.merge``, and ``ordered_merge`` now return the same type + as the ``left`` argument (:issue:`7737`). -Constructing a ``TimedeltaIndex`` with a regular range +- Previously an enlargement with a mixed-dtype frame would act unlike ``.append`` which will preserve dtypes (related :issue:`2578`, :issue:`8176`): -.. ipython:: python + .. ipython:: python - timedelta_range('1 days',periods=5,freq='D') - timedelta_range(start='1 days',end='2 days',freq='30T') + df = DataFrame([[True, 1],[False, 2]], + columns=["female","fitness"]) + df + df.dtypes -You can now use a ``TimedeltaIndex`` as the index of a pandas object + # dtypes are now preserved + df.loc[2] = df.loc[1] + df + df.dtypes -.. ipython:: python +- ``Series.to_csv()`` now returns a string when ``path=None``, matching the behaviour of ``DataFrame.to_csv()`` (:issue:`8215`). - s = Series(np.arange(5), - index=timedelta_range('1 days',periods=5,freq='s')) - s +- ``read_hdf`` now raises ``IOError`` when a file that doesn't exist is passed in. Previously, a new, empty file was created, and a ``KeyError`` raised (:issue:`7715`). -You can select with partial string selections +- ``DataFrame.info()`` now ends its output with a newline character (:issue:`8114`) +- Concatenating no objects will now raise a ``ValueError`` rather than a bare ``Exception``. +- Merge errors will now be sub-classes of ``ValueError`` rather than raw ``Exception`` (:issue:`8501`) +- ``DataFrame.plot`` and ``Series.plot`` keywords are now have consistent orders (:issue:`8037`) -.. ipython:: python - s['1 day 00:00:02'] - s['1 day':'1 day 00:00:02'] +.. _whatsnew_0150.refactoring: -Finally, the combination of ``TimedeltaIndex`` with ``DatetimeIndex`` allow certain combination operations that are ``NaT`` preserving: +Internal Refactoring +^^^^^^^^^^^^^^^^^^^^ -.. ipython:: python +In 0.15.0 ``Index`` has internally been refactored to no longer sub-class ``ndarray`` +but instead subclass ``PandasObject``, similarly to the rest of the pandas objects. This +change allows very easy sub-classing and creation of new index types. This should be +a transparent change with only very limited API implications (:issue:`5080`, :issue:`7439`, :issue:`7796`, :issue:`8024`, :issue:`8367`, :issue:`7997`, :issue:`8522`): - tdi = TimedeltaIndex(['1 days',pd.NaT,'2 days']) - tdi.tolist() - dti = date_range('20130101',periods=3) - dti.tolist() +- you may need to unpickle pandas version < 0.15.0 pickles using ``pd.read_pickle`` rather than ``pickle.load``. See :ref:`pickle docs ` +- when plotting with a ``PeriodIndex``, the matplotlib internal axes will now be arrays of ``Period`` rather than a ``PeriodIndex`` (this is similar to how a ``DatetimeIndex`` passes arrays of ``datetimes`` now) +- MultiIndexes will now raise similary to other pandas objects w.r.t. truth testing, see :ref:`here ` (:issue:`7897`). +- When plotting a DatetimeIndex directly with matplotlib's `plot` function, + the axis labels will no longer be formatted as dates but as integers (the + internal representation of a ``datetime64``). To keep the old behaviour you + should add the :meth:`~DatetimeIndex.to_pydatetime()` method: - (dti + tdi).tolist() - (dti - tdi).tolist() + .. code-block:: python -- iteration of a ``Series`` e.g. ``list(Series(...))`` of ``timedelta64[ns]`` would prior to v0.15.0 return ``np.timedelta64`` for each element. These will now be wrapped in ``Timedelta``. + import matplotlib.pyplot as plt + df = pd.DataFrame({'col': np.random.randint(1, 50, 60)}, + index=pd.date_range("2012-01-01", periods=60)) -.. _whatsnew_0150.prior_deprecations: + # this will now format the x axis labels as integers + plt.plot(df.index, df['col']) -Prior Version Deprecations/Changes -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + # to keep the date formating, convert the index explicitely to python datetime values + plt.plot(df.index.to_pydatetime(), df['col']) -- Remove ``DataFrame.delevel`` method in favor of ``DataFrame.reset_index`` .. _whatsnew_0150.deprecations: Deprecations -~~~~~~~~~~~~ +^^^^^^^^^^^^ +- The attributes ``Categorical`` ``labels`` and ``levels`` attributes are + deprecated and renamed to ``codes`` and ``categories``. - The ``outtype`` argument to ``pd.DataFrame.to_dict`` has been deprecated in favor of ``orient``. (:issue:`7840`) - The ``convert_dummies`` method has been deprecated in favor of ``get_dummies`` (:issue:`8140`) @@ -826,7 +835,7 @@ Deprecations .. _whatsnew_0150.index_set_ops: -- The ``Index`` set operations ``+`` and ``-`` were deprecated in order to provide these for numeric type operations on certain index types. ``+`` can be replace by ``.union()`` or ``|``, and ``-`` by ``.difference()``. Further the method name ``Index.diff()`` is deprecated and can be replaced by ``Index.difference()`` (:issue:`8226`) +- The ``Index`` set operations ``+`` and ``-`` were deprecated in order to provide these for numeric type operations on certain index types. ``+`` can be replaced by ``.union()`` or ``|``, and ``-`` by ``.difference()``. Further the method name ``Index.diff()`` is deprecated and can be replaced by ``Index.difference()`` (:issue:`8226`) .. code-block:: python @@ -844,34 +853,83 @@ Deprecations # should be replaced by Index(['a','b','c']).difference(Index(['b','c','d'])) -.. _whatsnew_0150.enhancements: +- The ``infer_types`` argument to :func:`~pandas.read_html` now has no + effect and is deprecated (:issue:`7762`, :issue:`7032`). -Enhancements -~~~~~~~~~~~~ -- Added support for a ``chunksize`` parameter to ``to_sql`` function. This allows DataFrame to be written in chunks and avoid packet-size overflow errors (:issue:`8062`). -- Added support for a ``chunksize`` parameter to ``read_sql`` function. Specifying this argument will return an iterator through chunks of the query result (:issue:`2908`). -- Added support for writing ``datetime.date`` and ``datetime.time`` object columns with ``to_sql`` (:issue:`6932`). -- Added support for specifying a ``schema`` to read from/write to with ``read_sql_table`` and ``to_sql`` (:issue:`7441`, :issue:`7952`). - For example: +.. _whatsnew_0150.prior_deprecations: - .. code-block:: python +Removal of prior version deprecations/changes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - df.to_sql('table', engine, schema='other_schema') - pd.read_sql_table('table', engine, schema='other_schema') +- Remove ``DataFrame.delevel`` method in favor of ``DataFrame.reset_index`` -- Added support for writing ``NaN`` values with ``to_sql`` (:issue:`2754`). -- Added support for writing datetime64 columns with ``to_sql`` for all database flavors (:issue:`7103`). + + +.. _whatsnew_0150.enhancements: + +Enhancements +~~~~~~~~~~~~ + +Enhancements in the importing/exporting of Stata files: - Added support for bool, uint8, uint16 and uint32 datatypes in ``to_stata`` (:issue:`7097`, :issue:`7365`) - Added conversion option when importing Stata files (:issue:`8527`) +- ``DataFrame.to_stata`` and ``StataWriter`` check string length for + compatibility with limitations imposed in dta files where fixed-width + strings must contain 244 or fewer characters. Attempting to write Stata + dta files with strings longer than 244 characters raises a ``ValueError``. (:issue:`7858`) +- ``read_stata`` and ``StataReader`` can import missing data information into a + ``DataFrame`` by setting the argument ``convert_missing`` to ``True``. When + using this options, missing values are returned as ``StataMissingValue`` + objects and columns containing missing values have ``object`` data type. (:issue:`8045`) +Enhancements in the plotting functions: + - Added ``layout`` keyword to ``DataFrame.plot``. You can pass a tuple of ``(rows, columns)``, one of which can be ``-1`` to automatically infer (:issue:`6667`, :issue:`8071`). - Allow to pass multiple axes to ``DataFrame.plot``, ``hist`` and ``boxplot`` (:issue:`5353`, :issue:`6970`, :issue:`7069`) - Added support for ``c``, ``colormap`` and ``colorbar`` arguments for ``DataFrame.plot`` with ``kind='scatter'`` (:issue:`7780`) +- Histogram from ``DataFrame.plot`` with ``kind='hist'`` (:issue:`7809`), See :ref:`the docs`. +- Boxplot from ``DataFrame.plot`` with ``kind='box'`` (:issue:`7998`), See :ref:`the docs`. + +Other: - ``read_csv`` now has a keyword parameter ``float_precision`` which specifies which floating-point converter the C engine should use during parsing, see :ref:`here ` (:issue:`8002`, :issue:`8044`) +- Added ``searchsorted`` method to ``Series`` objects (:issue:`7447`) + +- :func:`describe` on mixed-types DataFrames is more flexible. Type-based column filtering is now possible via the ``include``/``exclude`` arguments. + See the :ref:`docs ` (:issue:`8164`). + + .. ipython:: python + + df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8, + 'catB': ['a', 'b', 'c', 'd'] * 6, + 'numC': np.arange(24), + 'numD': np.arange(24.) + .5}) + df.describe(include=["object"]) + df.describe(include=["number", "object"], exclude=["float"]) + + Requesting all columns is possible with the shorthand 'all' + + .. ipython:: python + + df.describe(include='all') + + Without those arguments, 'describe` will behave as before, including only numerical columns or, if none are, only categorical columns. See also the :ref:`docs ` + +- Added ``split`` as an option to the ``orient`` argument in ``pd.DataFrame.to_dict``. (:issue:`7840`) + +- The ``get_dummies`` method can now be used on DataFrames. By default only + catagorical columns are encoded as 0's and 1's, while other columns are + left untouched. + + .. ipython:: python + + df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'], + 'C': [1, 2, 3]}) + pd.get_dummies(df) + - ``PeriodIndex`` supports ``resolution`` as the same as ``DatetimeIndex`` (:issue:`7708`) - ``pandas.tseries.holiday`` has added support for additional holidays and ways to observe holidays (:issue:`7070`) - ``pandas.tseries.holiday.Holiday`` now supports a list of offsets in Python3 (:issue:`7070`) @@ -900,20 +958,6 @@ Enhancements idx idx + pd.offsets.MonthEnd(3) -- Added ``split`` as an option to the ``orient`` argument in ``pd.DataFrame.to_dict``. (:issue:`7840`) - -- The ``get_dummies`` method can now be used on DataFrames. By default only - catagorical columns are encoded as 0's and 1's, while other columns are - left untouched. - - .. ipython:: python - - df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'], - 'C': [1, 2, 3]}) - pd.get_dummies(df) - - - - Added experimental compatibility with ``openpyxl`` for versions >= 2.0. The ``DataFrame.to_excel`` method ``engine`` keyword now recognizes ``openpyxl1`` and ``openpyxl2`` which will explicitly require openpyxl v1 and v2 respectively, failing if @@ -923,41 +967,64 @@ Enhancements - ``DataFrame.fillna`` can now accept a ``DataFrame`` as a fill value (:issue:`8377`) -- Added ``searchsorted`` method to ``Series`` objects (:issue:`7447`) - -.. _whatsnew_0150.performance: - -Performance -~~~~~~~~~~~ +- Passing multiple levels to :meth:`~pandas.DataFrame.stack()` will now work when multiple level + numbers are passed (:issue:`7660`). See + :ref:`Reshaping by stacking and unstacking `. -- Performance improvements in ``DatetimeIndex.__iter__`` to allow faster iteration (:issue:`7683`) -- Performance improvements in ``Period`` creation (and ``PeriodIndex`` setitem) (:issue:`5155`) -- Improvements in Series.transform for significant performance gains (revised) (:issue:`6496`) -- Performance improvements in ``StataReader`` when reading large files (:issue:`8040`, :issue:`8073`) -- Performance improvements in ``StataWriter`` when writing large files (:issue:`8079`) -- Performance and memory usage improvements in multi-key ``groupby`` (:issue:`8128`) -- Performance improvements in groupby ``.agg`` and ``.apply`` where builtins max/min were not mapped to numpy/cythonized versions (:issue:`7722`) -- Performance improvement in writing to sql (``to_sql``) of up to 50% (:issue:`8208`). -- Performance benchmarking of groupby for large value of ngroups (:issue:`6787`) -- Performance improvement in ``CustomBusinessDay``, ``CustomBusinessMonth`` (:issue:`8236`) -- Performance improvement for ``MultiIndex.values`` for multi-level indexes containing datetimes (:issue:`8543`) +- :func:`set_names`, :func:`set_labels`, and :func:`set_levels` methods now take an optional ``level`` keyword argument to all modification of specific level(s) of a MultiIndex. Additionally :func:`set_names` now accepts a scalar string value when operating on an ``Index`` or on a specific level of a ``MultiIndex`` (:issue:`7792`) + .. ipython:: python + idx = MultiIndex.from_product([['a'], range(3), list("pqr")], names=['foo', 'bar', 'baz']) + idx.set_names('qux', level=0) + idx.set_names(['qux','baz'], level=[0,1]) + idx.set_levels(['a','b','c'], level='bar') + idx.set_levels([['a','b','c'],[1,2,3]], level=[1,2]) +- ``Index.isin`` now supports a ``level`` argument to specify which index level + to use for membership tests (:issue:`7892`, :issue:`7890`) + .. code-block:: python + In [1]: idx = MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]) + In [2]: idx.values + Out[2]: array([(0, 'a'), (0, 'b'), (0, 'c'), (1, 'a'), (1, 'b'), (1, 'c')], dtype=object) + In [3]: idx.isin(['a', 'c', 'e'], level=1) + Out[3]: array([ True, False, True, True, False, True], dtype=bool) +- ``Index`` now supports ``duplicated`` and ``drop_duplicates``. (:issue:`4060`) + .. ipython:: python + idx = Index([1, 2, 3, 4, 1, 2]) + idx + idx.duplicated() + idx.drop_duplicates() +- add ``copy=True`` argument to ``pd.concat`` to enable pass thru of complete blocks (:issue:`8252`) +- Added support for numpy 1.8+ data types (``bool_``, ``int_``, ``float_``, ``string_``) for conversion to R dataframe (:issue:`8400`) +.. _whatsnew_0150.performance: +Performance +~~~~~~~~~~~ +- Performance improvements in ``DatetimeIndex.__iter__`` to allow faster iteration (:issue:`7683`) +- Performance improvements in ``Period`` creation (and ``PeriodIndex`` setitem) (:issue:`5155`) +- Improvements in Series.transform for significant performance gains (revised) (:issue:`6496`) +- Performance improvements in ``StataReader`` when reading large files (:issue:`8040`, :issue:`8073`) +- Performance improvements in ``StataWriter`` when writing large files (:issue:`8079`) +- Performance and memory usage improvements in multi-key ``groupby`` (:issue:`8128`) +- Performance improvements in groupby ``.agg`` and ``.apply`` where builtins max/min were not mapped to numpy/cythonized versions (:issue:`7722`) +- Performance improvement in writing to sql (``to_sql``) of up to 50% (:issue:`8208`). +- Performance benchmarking of groupby for large value of ngroups (:issue:`6787`) +- Performance improvement in ``CustomBusinessDay``, ``CustomBusinessMonth`` (:issue:`8236`) +- Performance improvement for ``MultiIndex.values`` for multi-level indexes containing datetimes (:issue:`8543`) @@ -965,6 +1032,7 @@ Performance Bug Fixes ~~~~~~~~~ + - Bug in pivot_table, when using margins and a dict aggfunc (:issue:`8349`) - Bug in ``read_csv`` where ``squeeze=True`` would return a view (:issue:`8217`) - Bug in checking of table name in ``read_sql`` in certain cases (:issue:`7826`). @@ -1002,44 +1070,26 @@ Bug Fixes - Bug in ``PeriodIndex.unique`` returns int64 ``np.ndarray`` (:issue:`7540`) - Bug in ``groupby.apply`` with a non-affecting mutation in the function (:issue:`8467`) - Bug in ``DataFrame.reset_index`` which has ``MultiIndex`` contains ``PeriodIndex`` or ``DatetimeIndex`` with tz raises ``ValueError`` (:issue:`7746`, :issue:`7793`) - - - Bug in ``DataFrame.plot`` with ``subplots=True`` may draw unnecessary minor xticks and yticks (:issue:`7801`) - Bug in ``StataReader`` which did not read variable labels in 117 files due to difference between Stata documentation and implementation (:issue:`7816`) - Bug in ``StataReader`` where strings were always converted to 244 characters-fixed width irrespective of underlying string size (:issue:`7858`) - - Bug in ``DataFrame.plot`` and ``Series.plot`` may ignore ``rot`` and ``fontsize`` keywords (:issue:`7844`) - - - Bug in ``DatetimeIndex.value_counts`` doesn't preserve tz (:issue:`7735`) - Bug in ``PeriodIndex.value_counts`` results in ``Int64Index`` (:issue:`7735`) - Bug in ``DataFrame.join`` when doing left join on index and there are multiple matches (:issue:`5391`) - - - - Bug in ``GroupBy.transform()`` where int groups with a transform that didn't preserve the index were incorrectly truncated (:issue:`7972`). - - Bug in ``groupby`` where callable objects without name attributes would take the wrong path, and produce a ``DataFrame`` instead of a ``Series`` (:issue:`7929`) - - Bug in ``groupby`` error message when a DataFrame grouping column is duplicated (:issue:`7511`) - - Bug in ``read_html`` where the ``infer_types`` argument forced coercion of date-likes incorrectly (:issue:`7762`, :issue:`7032`). - - - Bug in ``Series.str.cat`` with an index which was filtered as to not include the first item (:issue:`7857`) - - - Bug in ``Timestamp`` cannot parse ``nanosecond`` from string (:issue:`7878`) - Bug in ``Timestamp`` with string offset and ``tz`` results incorrect (:issue:`7833`) - - Bug in ``tslib.tz_convert`` and ``tslib.tz_convert_single`` may return different results (:issue:`7798`) - Bug in ``DatetimeIndex.intersection`` of non-overlapping timestamps with tz raises ``IndexError`` (:issue:`7880`) - Bug in alignment with TimeOps and non-unique indexes (:issue:`8363`) - - - Bug in ``GroupBy.filter()`` where fast path vs. slow path made the filter return a non scalar value that appeared valid but wasn't (:issue:`7870`). - Bug in ``date_range()``/``DatetimeIndex()`` when the timezone was inferred from input dates yet incorrect @@ -1048,46 +1098,23 @@ Bug Fixes - Bug in area plot draws legend with incorrect ``alpha`` when ``stacked=True`` (:issue:`8027`) - ``Period`` and ``PeriodIndex`` addition/subtraction with ``np.timedelta64`` results in incorrect internal representations (:issue:`7740`) - Bug in ``Holiday`` with no offset or observance (:issue:`7987`) - - Bug in ``DataFrame.to_latex`` formatting when columns or index is a ``MultiIndex`` (:issue:`7982`). - - Bug in ``DateOffset`` around Daylight Savings Time produces unexpected results (:issue:`5175`). - - - - - - Bug in ``DataFrame.shift`` where empty columns would throw ``ZeroDivisionError`` on numpy 1.7 (:issue:`8019`) - - - - - - Bug in installation where ``html_encoding/*.html`` wasn't installed and therefore some tests were not running correctly (:issue:`7927`). - - Bug in ``read_html`` where ``bytes`` objects were not tested for in ``_read`` (:issue:`7927`). - - Bug in ``DataFrame.stack()`` when one of the column levels was a datelike (:issue:`8039`) - Bug in broadcasting numpy scalars with ``DataFrame`` (:issue:`8116`) - - - Bug in ``pivot_table`` performed with nameless ``index`` and ``columns`` raises ``KeyError`` (:issue:`8103`) - - Bug in ``DataFrame.plot(kind='scatter')`` draws points and errorbars with different colors when the color is specified by ``c`` keyword (:issue:`8081`) - - - - - Bug in ``Float64Index`` where ``iat`` and ``at`` were not testing and were failing (:issue:`8092`). - Bug in ``DataFrame.boxplot()`` where y-limits were not set correctly when producing multiple axes (:issue:`7528`, :issue:`5517`). - - Bug in ``read_csv`` where line comments were not handled correctly given a custom line terminator or ``delim_whitespace=True`` (:issue:`8122`). - - Bug in ``read_html`` where empty tables caused a ``StopIteration`` (:issue:`7575`) - Bug in casting when setting a column in a same-dtype block (:issue:`7704`) - Bug in accessing groups from a ``GroupBy`` when the original grouper @@ -1097,7 +1124,6 @@ Bug Fixes - Bug in ``GroupBy.count`` with float32 data type were nan values were not excluded (:issue:`8169`). - Bug with stacked barplots and NaNs (:issue:`8175`). - Bug in resample with non evenly divisible offsets (e.g. '7s') (:issue:`8371`) - - Bug in interpolation methods with the ``limit`` keyword when no values needed interpolating (:issue:`7173`). - Bug where ``col_space`` was ignored in ``DataFrame.to_string()`` when ``header=False`` (:issue:`8230`). - Bug with ``DatetimeIndex.asof`` incorrectly matching partial strings and returning the wrong date (:issue:`8245`). @@ -1121,6 +1147,6 @@ Bug Fixes - Bug in ``Series`` that allows it to be indexed by a ``DataFrame`` which has unexpected results. Such indexing is no longer permitted (:issue:`8444`) - Bug in item assignment of a ``DataFrame`` with multi-index columns where right-hand-side columns were not aligned (:issue:`7655`) - Suppress FutureWarning generated by NumPy when comparing object arrays containing NaN for equality (:issue:`7065`) - - Bug in ``DataFrame.eval()`` where the dtype of the ``not`` operator (``~``) was not correctly inferred as ``bool``. +