Skip to content

Implicit alignment in operations #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomAugspurger opened this issue Jun 5, 2020 · 10 comments
Open

Implicit alignment in operations #12

TomAugspurger opened this issue Jun 5, 2020 · 10 comments

Comments

@TomAugspurger
Copy link

In #2 there seems to be some agreement that row-labels are an important component of a dataframe. Pandas takes this a step further by using them for alignment in many operations involving multiple dataframes.

In [10]: a = pd.DataFrame({"A": [1, 2, 3]}, index=['a', 'b', 'c'])

In [11]: b = pd.DataFrame({"A": [2, 3, 1]}, index=['b', 'c', 'a'])

In [12]: a
Out[12]:
   A
a  1
b  2
c  3

In [13]: b
Out[13]:
   A
b  2
c  3
a  1

In [14]: a + b
Out[14]:
   A
a  2
b  4
c  6

In the background there's an implicit a.align(b), which reindexes the dataframes to a common index. The resulting index will be the union of the two indices.

A few other places this occurs

  • Indexing a DataFrame / Series with an integer or boolean series
  • pd.concat
  • DataFrame constructor

Do we want to adopt this behavior for the standard?

@rgommers
Copy link
Member

rgommers commented Jun 5, 2020

In #2 there seems to be some agreement that row-labels are an important component of a dataframe.

Eh, just to make sure, can you summarize that agreement? As far as I can see you suggested that including row labels was inappropriate, and @devin-petersohn was in favor but also noted that columnar dataframes like Vaex and R tidyverse do not support row labels. Hence my impression was that row labels should be optional or a "level 1" feature.

@devin-petersohn
Copy link
Member

@rgommers I do not believe that the presence of row labels affects this conversation directly because a + b could be reasonably done with the position instead of row labels, and in that case the position is the row's label. Side note: let's figure out this "levels" thing. It is hard to have meaningful conversations without the concrete levels.

@TomAugspurger This brings up an interesting discussion about joining/manipulating the row labels (or order) and how that interacts with the data in a dataframe.

If we dissect the a + b operation, we are effectively doing a join along both axes, and adding (or another binary operation) on label collisions in both axes. This is a bit unusual from the database perspective, but it can be done (though it is tedious).

So there is this ability to treat labels as data that can be joined on, or manipulating the order of the rows with an align-style join. It's a very nice property for visualization.

@TomAugspurger
Copy link
Author

Re-reading #2, it does seem that I overstated the level of support for row labels.

As Devin notes, there's a positional version of label alignment:

In [9]: a = pd.Series([1.0, 2.0])

In [10]: b = pd.Series([1.0, 2.0, 3.0])

In [11]: a
Out[11]:
0    1.0
1    2.0
dtype: float64

In [12]: b
Out[12]:
0    1.0
1    2.0
2    3.0
dtype: float64

In [13]: a + b
Out[13]:
0    2.0
1    4.0
2    NaN
dtype: float64

So do we expect that operation to raise (different shapes) or align (by position). My recommendation would be to align.

@amueller
Copy link

amueller commented Jun 8, 2020

This might be a silly questions but I guess we agree/assume that there's alignment for column names, right? There's also a question whether to raise there on misalignment or drop or create nan columns...

@TomAugspurger
Copy link
Author

TomAugspurger commented Jun 8, 2020 via email

@TomAugspurger
Copy link
Author

For reference, it seems like vaex raises when the lengths of the "column" (expression) don't match

In [77]: df1 = vaex.from_dict({"A": [1, 2, 3]})

In [78]: df1.A[:2] + df1.A
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-78-df2975e4c3c9> in <module>
----> 1 df1.A[:2] + df1.A

~/miniconda3/envs/vaex/lib/python3.8/site-packages/vaex/expression.py in f(a, b)
    111                     else:
    112                         if isinstance(b, Expression):
--> 113                             assert b.ds == a.ds
    114                             b = b.expression
    115                         elif isinstance(b, (np.timedelta64)):

AssertionError:

This might be a silly questions but I guess we agree/assume that there's alignment for column names, right

It seems like this doesn't come up for vaex, which AFAICT doesn't implement binary operators (can you confirm that @maartenbreddels?)

Do other dataframe implementors want to weigh in on what's desired / feasible here?

From Dask's perspective, alignment is doable. We partition by divisions on the index. When those divisions aren't available a full shuffle is needed to do the operation.

@datapythonista
Copy link
Member

Trying to structure a bit the discussion, this is how I see the different components of what is being discussed here (with an example):

>>> import pandas
>>> df = pandas.DataFrame({'country': ['France', 'USA', 'UK'],
...                        'capital': ['Paris', 'DC', 'London']})
>>> df
  country capital
0  France   Paris
1     USA      DC
2      UK  London

Basic case, same size, same index, in the same order (I guess whatever we do, it will work):

>>> df['country'] + ' - ' + df['capital']
0    France - Paris
1          USA - DC
2       UK - London
dtype: object

Same size and index, but index in different order. With row labels and automatic alignment, what we have is:

>>> df['country'] + ' - ' + df['capital'].sort_values()
0    France - Paris
1          USA - DC
2       UK - London
dtype: object

Without row labels (or without automatic alignment), I guess we would operate by row id, and rely on sorting for the alignment df.sort_values('country_id').

When the size of the dataframes is different, with automatic alignment, pandas fills with NA after aligning, and then operates:

>>> df['country'] + ' - ' + df[df.capital.str.len() > 3]['capital']
0    France - Paris
1               NaN
2       UK - London
dtype: object

Without row labels, I guess the best solution would probably be to fail if the size is different, and rely on a join / reindex to force the user to make the alignment explicitly df1 + df1.join(on='country_id', how='left').

So, correct me if I'm wrong, but I think the decisions that need to be made regarding alignment are:

  • Do we want row labels?
  • Do we want automatic alignment?
  • Do we want to automatically create NA rows if the index values don't match?

@TomAugspurger
Copy link
Author

TomAugspurger commented Jun 22, 2020 via email

@maartenbreddels
Copy link

For reference, it seems like vaex raises when the lengths of the "column" (expression) don't match

It seems like this doesn't come up for vaex, which AFAICT doesn't implement binary operators (can you confirm that @maartenbreddels?)

Yes, basically vaex does not have row labels, so both operations do not make sense in the current state. There is a branch which lets the dataframe behave like a 2d array (nep13/nep18), meaning implicit row labels that are row numbers. In the case of a binary operator it will ignore the column labels, and only use the column index, similar to a 2d array.

@MarcoGorelli
Copy link
Contributor

In case it's of interest, in Narwhals we solved this by following the left-hand rule for index alignment: https://narwhals-dev.github.io/narwhals/pandas_like_concepts/pandas_index/

An unexpected benefit of this was that, for Plotly, moving to Narwhals ended up solving some existing bugs for free

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants