diff --git a/doc/source/comparison_with_r.rst b/doc/source/comparison_with_r.rst index 0841f3354d160..fad3d034c8d17 100644 --- a/doc/source/comparison_with_r.rst +++ b/doc/source/comparison_with_r.rst @@ -31,6 +31,79 @@ For transfer of ``DataFrame`` objects from ``pandas`` to R, one option is to use HDF5 files, see :ref:`io.external_compatibility` for an example. + +Quick Reference +--------------- + +We'll start off with a quick reference guide pairing some common R +operations using `dplyr +`__ with +pandas equivalents. + + +Querying, Filtering, Sampling +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +=========================================== =========================================== +R pandas +=========================================== =========================================== +``dim(df)`` ``df.shape`` +``head(df)`` ``df.head()`` +``slice(df, 1:10)`` ``df.iloc[:9]`` +``filter(df, col1 == 1, col2 == 1)`` ``df.query('col1 == 1 & col2 == 1')`` +``df[df$col1 == 1 & df$col2 == 1,]`` ``df[(df.col1 == 1) & (df.col2 == 1)]`` +``select(df, col1, col2)`` ``df[['col1', 'col2']]`` +``select(df, col1:col3)`` ``df.loc[:, 'col1':'col3']`` +``select(df, -(col1:col3))`` ``df.drop(cols_to_drop, axis=1)`` but see [#select_range]_ +``distinct(select(df, col1))`` ``df[['col1']].drop_duplicates()`` +``distinct(select(df, col1, col2))`` ``df[['col1', 'col2']].drop_duplicates()`` +``sample_n(df, 10)`` ``df.sample(n=10)`` +``sample_frac(df, 0.01)`` ``df.sample(frac=0.01)`` +=========================================== =========================================== + +.. [#select_range] R's shorthand for a subrange of columns + (``select(df, col1:col3)``) can be approached + cleanly in pandas, if you have the list of columns, + for example ``df[cols[1:3]]`` or + ``df.drop(cols[1:3])``, but doing this by column + name is a bit messy. + + +Sorting +~~~~~~~ + +=========================================== =========================================== +R pandas +=========================================== =========================================== +``arrange(df, col1, col2)`` ``df.sort_values(['col1', 'col2'])`` +``arrange(df, desc(col1))`` ``df.sort_values('col1', ascending=False)`` +=========================================== =========================================== + +Transforming +~~~~~~~~~~~~ + +=========================================== =========================================== +R pandas +=========================================== =========================================== +``select(df, col_one = col1)`` ``df.rename(columns={'col1': 'col_one'})['col_one']`` +``rename(df, col_one = col1)`` ``df.rename(columns={'col1': 'col_one'})`` +``mutate(df, c=a-b)`` ``df.assign(c=df.a-df.b)`` +=========================================== =========================================== + + +Grouping and Summarizing +~~~~~~~~~~~~~~~~~~~~~~~~ + +============================================== =========================================== +R pandas +============================================== =========================================== +``summary(df)`` ``df.describe()`` +``gdf <- group_by(df, col1)`` ``gdf = df.groupby('col1')`` +``summarise(gdf, avg=mean(col1, na.rm=TRUE))`` ``df.groupby('col1').agg({'col1': 'mean'})`` +``summarise(gdf, total=sum(col1))`` ``df.groupby('col1').sum()`` +============================================== =========================================== + + Base R ------