Skip to content

DOC: expanding comparison with R section #12472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions doc/source/comparison_with_r.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,79 @@ For transfer of ``DataFrame`` objects from ``pandas`` to R, one option is to
use HDF5 files, see :ref:`io.external_compatibility` for an
example.


Quick Reference
---------------

We'll start off with a quick reference guide pairing some common R
operations using `dplyr
<http://cran.r-project.org/web/packages/dplyr/index.html>`__ with
pandas equivalents.


Querying, Filtering, Sampling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

=========================================== ===========================================
R pandas
=========================================== ===========================================
``dim(df)`` ``df.shape``
``head(df)`` ``df.head()``
``slice(df, 1:10)`` ``df.iloc[:9]``
``filter(df, col1 == 1, col2 == 1)`` ``df.query('col1 == 1 & col2 == 1')``
``df[df$col1 == 1 & df$col2 == 1,]`` ``df[(df.col1 == 1) & (df.col2 == 1)]``
``select(df, col1, col2)`` ``df[['col1', 'col2']]``
``select(df, col1:col3)`` ``df.loc[:, 'col1':'col3']``
``select(df, -(col1:col3))`` ``df.drop(cols_to_drop, axis=1)`` but see [#select_range]_
``distinct(select(df, col1))`` ``df[['col1']].drop_duplicates()``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in R does this return a different shape (e.g. Series/DataFrame distinction) if you provide 1 vs multiple columns?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, let me see if I can reproduce.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
> distinct(select(mtcars, gear))
  gear
1    4
2    3
3    5
> distinct(select(mtcars, gear, carb))
   gear carb
1     4    4
2     4    1
3     3    1
4     3    2
5     3    4
6     4    2
7     3    3
8     5    2
9     5    4
10    5    6
11    5    8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I think it's the same type either way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm that's interesting. ok best then to show the frame result then (which i think is what you did) (even for 1 column)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the simplest way I could find to select a single series in R:

> distinct(select(mtcars, gear))$gear
[1] 4 3 5

``distinct(select(df, col1, col2))`` ``df[['col1', 'col2']].drop_duplicates()``
``sample_n(df, 10)`` ``df.sample(n=10)``
``sample_frac(df, 0.01)`` ``df.sample(frac=0.01)``
=========================================== ===========================================

.. [#select_range] R's shorthand for a subrange of columns
(``select(df, col1:col3)``) can be approached
cleanly in pandas, if you have the list of columns,
for example ``df[cols[1:3]]`` or
``df.drop(cols[1:3])``, but doing this by column
name is a bit messy.


Sorting
~~~~~~~

=========================================== ===========================================
R pandas
=========================================== ===========================================
``arrange(df, col1, col2)`` ``df.sort_values(['col1', 'col2'])``
``arrange(df, desc(col1))`` ``df.sort_values('col1', ascending=False)``
=========================================== ===========================================

Transforming
~~~~~~~~~~~~

=========================================== ===========================================
R pandas
=========================================== ===========================================
``select(df, col_one = col1)`` ``df.rename(columns={'col1': 'col_one'})['col_one']``
``rename(df, col_one = col1)`` ``df.rename(columns={'col1': 'col_one'})``
``mutate(df, c=a-b)`` ``df.assign(c=df.a-df.b)``
=========================================== ===========================================


Grouping and Summarizing
~~~~~~~~~~~~~~~~~~~~~~~~

============================================== ===========================================
R pandas
============================================== ===========================================
``summary(df)`` ``df.describe()``
``gdf <- group_by(df, col1)`` ``gdf = df.groupby('col1')``
``summarise(gdf, avg=mean(col1, na.rm=TRUE))`` ``df.groupby('col1').agg({'col1': 'mean'})``
``summarise(gdf, total=sum(col1))`` ``df.groupby('col1').sum()``
============================================== ===========================================


Base R
------

Expand Down