Implicit alignment in operations #12

TomAugspurger · 2020-06-05T17:27:10Z

In #2 there seems to be some agreement that row-labels are an important component of a dataframe. Pandas takes this a step further by using them for alignment in many operations involving multiple dataframes.

In [10]: a = pd.DataFrame({"A": [1, 2, 3]}, index=['a', 'b', 'c'])

In [11]: b = pd.DataFrame({"A": [2, 3, 1]}, index=['b', 'c', 'a'])

In [12]: a
Out[12]:
   A
a  1
b  2
c  3

In [13]: b
Out[13]:
   A
b  2
c  3
a  1

In [14]: a + b
Out[14]:
   A
a  2
b  4
c  6

In the background there's an implicit a.align(b), which reindexes the dataframes to a common index. The resulting index will be the union of the two indices.

A few other places this occurs

Indexing a DataFrame / Series with an integer or boolean series
pd.concat
DataFrame constructor

Do we want to adopt this behavior for the standard?

The text was updated successfully, but these errors were encountered:

rgommers · 2020-06-05T20:24:26Z

In #2 there seems to be some agreement that row-labels are an important component of a dataframe.

Eh, just to make sure, can you summarize that agreement? As far as I can see you suggested that including row labels was inappropriate, and @devin-petersohn was in favor but also noted that columnar dataframes like Vaex and R tidyverse do not support row labels. Hence my impression was that row labels should be optional or a "level 1" feature.

devin-petersohn · 2020-06-06T14:46:48Z

@rgommers I do not believe that the presence of row labels affects this conversation directly because a + b could be reasonably done with the position instead of row labels, and in that case the position is the row's label. Side note: let's figure out this "levels" thing. It is hard to have meaningful conversations without the concrete levels.

@TomAugspurger This brings up an interesting discussion about joining/manipulating the row labels (or order) and how that interacts with the data in a dataframe.

If we dissect the a + b operation, we are effectively doing a join along both axes, and adding (or another binary operation) on label collisions in both axes. This is a bit unusual from the database perspective, but it can be done (though it is tedious).

So there is this ability to treat labels as data that can be joined on, or manipulating the order of the rows with an align-style join. It's a very nice property for visualization.

TomAugspurger · 2020-06-08T13:53:31Z

Re-reading #2, it does seem that I overstated the level of support for row labels.

As Devin notes, there's a positional version of label alignment:

In [9]: a = pd.Series([1.0, 2.0])

In [10]: b = pd.Series([1.0, 2.0, 3.0])

In [11]: a
Out[11]:
0    1.0
1    2.0
dtype: float64

In [12]: b
Out[12]:
0    1.0
1    2.0
2    3.0
dtype: float64

In [13]: a + b
Out[13]:
0    2.0
1    4.0
2    NaN
dtype: float64

So do we expect that operation to raise (different shapes) or align (by position). My recommendation would be to align.

amueller · 2020-06-08T15:58:43Z

This might be a silly questions but I guess we agree/assume that there's alignment for column names, right? There's also a question whether to raise there on misalignment or drop or create nan columns...

TomAugspurger · 2020-06-08T16:27:19Z

We'll want to explicitly state the expected behavior for column names. I'd expect it to match the behavior for row labels.

…

On Mon, Jun 8, 2020 at 10:58 AM Andreas Mueller ***@***.***> wrote: This might be a silly questions but I guess we agree/assume that there's alignment for column names, right? There's also a question whether to raise there on misalignment or drop or create nan columns... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIQGJRTI2YW7ZNGGFSDRVUDEHANCNFSM4NUM7E3A> .

TomAugspurger · 2020-06-09T21:12:20Z

For reference, it seems like vaex raises when the lengths of the "column" (expression) don't match

In [77]: df1 = vaex.from_dict({"A": [1, 2, 3]})

In [78]: df1.A[:2] + df1.A
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-78-df2975e4c3c9> in <module>
----> 1 df1.A[:2] + df1.A

~/miniconda3/envs/vaex/lib/python3.8/site-packages/vaex/expression.py in f(a, b)
    111                     else:
    112                         if isinstance(b, Expression):
--> 113                             assert b.ds == a.ds
    114                             b = b.expression
    115                         elif isinstance(b, (np.timedelta64)):

AssertionError:

This might be a silly questions but I guess we agree/assume that there's alignment for column names, right

It seems like this doesn't come up for vaex, which AFAICT doesn't implement binary operators (can you confirm that @maartenbreddels?)

Do other dataframe implementors want to weigh in on what's desired / feasible here?

From Dask's perspective, alignment is doable. We partition by divisions on the index. When those divisions aren't available a full shuffle is needed to do the operation.

datapythonista · 2020-06-19T14:55:36Z

Trying to structure a bit the discussion, this is how I see the different components of what is being discussed here (with an example):

>>> import pandas
>>> df = pandas.DataFrame({'country': ['France', 'USA', 'UK'],
...                        'capital': ['Paris', 'DC', 'London']})
>>> df
  country capital
0  France   Paris
1     USA      DC
2      UK  London

Basic case, same size, same index, in the same order (I guess whatever we do, it will work):

>>> df['country'] + ' - ' + df['capital']
0    France - Paris
1          USA - DC
2       UK - London
dtype: object

Same size and index, but index in different order. With row labels and automatic alignment, what we have is:

>>> df['country'] + ' - ' + df['capital'].sort_values()
0    France - Paris
1          USA - DC
2       UK - London
dtype: object

Without row labels (or without automatic alignment), I guess we would operate by row id, and rely on sorting for the alignment df.sort_values('country_id').

When the size of the dataframes is different, with automatic alignment, pandas fills with NA after aligning, and then operates:

>>> df['country'] + ' - ' + df[df.capital.str.len() > 3]['capital']
0    France - Paris
1               NaN
2       UK - London
dtype: object

Without row labels, I guess the best solution would probably be to fail if the size is different, and rely on a join / reindex to force the user to make the alignment explicitly df1 + df1.join(on='country_id', how='left').

So, correct me if I'm wrong, but I think the decisions that need to be made regarding alignment are:

Do we want row labels?
Do we want automatic alignment?
Do we want to automatically create NA rows if the index values don't match?

TomAugspurger · 2020-06-22T14:19:01Z

Thanks for the summary Marc. I think your three bullets perfectly capture the three levels to this issue. I suppose there might be one more question: Do we leave binary operations between DataFrame objects out of the spec entirely? That sidesteps the issue of row labels & alignment. And if we do allow binary operations between - DataFrame & scalars - DataFrame & arrays (where an array is an unlabeled column of a dataframe. Only requirement is that the shape is compatible.) then perhaps there isn't much of a loss in functionality?

…

On Fri, Jun 19, 2020 at 9:55 AM Marc Garcia ***@***.***> wrote: Trying to structure a bit the discussion, this is how I see the different components of what is being discussed here (with an example): >>> import pandas>>> df = pandas.DataFrame({'country': ['France', 'USA', 'UK'], ... 'capital': ['Paris', 'DC', 'London']})>>> df country capital0 France Paris1 USA DC2 UK London Basic case, same size, same index, in the same order (I guess whatever we do, it will work): >>> df['country'] + ' - ' + df['capital']0 France - Paris1 USA - DC2 UK - Londondtype: object Same size and index, but index in different order. With row labels and automatic alignment, what we have is: >>> df['country'] + ' - ' + df['capital'].sort_values()0 France - Paris1 USA - DC2 UK - Londondtype: object Without row labels (or without automatic alignment), I guess we would operate by row id, and rely on sorting for the alignment df.sort_values('country_id'). When the size of the dataframes is different, with automatic alignment, pandas fills with NA after aligning, and then operates: >>> df['country'] + ' - ' + df[df.capital.str.len() > 3]['capital']0 France - Paris1 NaN2 UK - Londondtype: object Without row labels, I guess the best solution would probably be to fail if the size is different, and rely on a join / reindex to force the user to make the alignment explicitly df1 + df1.join(on='country_id', how='left'). So, correct me if I'm wrong, but I think the decisions that need to be made regarding alignment are: - Do we want row labels? - Do we want automatic alignment? - Do we want to automatically create NA rows if the index values don't match? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIQWVRNLGXYZWWVGVSTRXN37PANCNFSM4NUM7E3A> .

maartenbreddels · 2020-06-25T16:53:33Z

For reference, it seems like vaex raises when the lengths of the "column" (expression) don't match

It seems like this doesn't come up for vaex, which AFAICT doesn't implement binary operators (can you confirm that @maartenbreddels?)

Yes, basically vaex does not have row labels, so both operations do not make sense in the current state. There is a branch which lets the dataframe behave like a 2d array (nep13/nep18), meaning implicit row labels that are row numbers. In the case of a binary operator it will ignore the column labels, and only use the column index, similar to a 2d array.

MarcoGorelli · 2024-12-04T15:32:06Z

In case it's of interest, in Narwhals we solved this by following the left-hand rule for index alignment: https://narwhals-dev.github.io/narwhals/pandas_like_concepts/pandas_index/

An unexpected benefit of this was that, for Plotly, moving to Narwhals ended up solving some existing bugs for free

datapythonista mentioned this issue Jun 23, 2020

Dataframe MVP #14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implicit alignment in operations #12

Implicit alignment in operations #12

TomAugspurger commented Jun 5, 2020

rgommers commented Jun 5, 2020

devin-petersohn commented Jun 6, 2020

TomAugspurger commented Jun 8, 2020

amueller commented Jun 8, 2020

TomAugspurger commented Jun 8, 2020 via email

TomAugspurger commented Jun 9, 2020

datapythonista commented Jun 19, 2020

TomAugspurger commented Jun 22, 2020 via email

maartenbreddels commented Jun 25, 2020

MarcoGorelli commented Dec 4, 2024

Implicit alignment in operations #12

Implicit alignment in operations #12

Comments

TomAugspurger commented Jun 5, 2020

rgommers commented Jun 5, 2020

devin-petersohn commented Jun 6, 2020

TomAugspurger commented Jun 8, 2020

amueller commented Jun 8, 2020

TomAugspurger commented Jun 8, 2020 via email

TomAugspurger commented Jun 9, 2020

datapythonista commented Jun 19, 2020

TomAugspurger commented Jun 22, 2020 via email

maartenbreddels commented Jun 25, 2020

MarcoGorelli commented Dec 4, 2024