-
Notifications
You must be signed in to change notification settings - Fork 21
Implicit alignment in operations #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Eh, just to make sure, can you summarize that agreement? As far as I can see you suggested that including row labels was inappropriate, and @devin-petersohn was in favor but also noted that columnar dataframes like Vaex and R tidyverse do not support row labels. Hence my impression was that row labels should be optional or a "level 1" feature. |
@rgommers I do not believe that the presence of row labels affects this conversation directly because @TomAugspurger This brings up an interesting discussion about joining/manipulating the row labels (or order) and how that interacts with the data in a dataframe. If we dissect the So there is this ability to treat labels as data that can be joined on, or manipulating the order of the rows with an |
Re-reading #2, it does seem that I overstated the level of support for row labels. As Devin notes, there's a positional version of label alignment: In [9]: a = pd.Series([1.0, 2.0])
In [10]: b = pd.Series([1.0, 2.0, 3.0])
In [11]: a
Out[11]:
0 1.0
1 2.0
dtype: float64
In [12]: b
Out[12]:
0 1.0
1 2.0
2 3.0
dtype: float64
In [13]: a + b
Out[13]:
0 2.0
1 4.0
2 NaN
dtype: float64 So do we expect that operation to raise (different shapes) or align (by position). My recommendation would be to align. |
This might be a silly questions but I guess we agree/assume that there's alignment for column names, right? There's also a question whether to raise there on misalignment or drop or create nan columns... |
We'll want to explicitly state the expected behavior for column names. I'd
expect it to match the behavior for row labels.
…On Mon, Jun 8, 2020 at 10:58 AM Andreas Mueller ***@***.***> wrote:
This might be a silly questions but I guess we agree/assume that there's
alignment for column names, right? There's also a question whether to raise
there on misalignment or drop or create nan columns...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIQGJRTI2YW7ZNGGFSDRVUDEHANCNFSM4NUM7E3A>
.
|
For reference, it seems like vaex raises when the lengths of the "column" (expression) don't match In [77]: df1 = vaex.from_dict({"A": [1, 2, 3]})
In [78]: df1.A[:2] + df1.A
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-78-df2975e4c3c9> in <module>
----> 1 df1.A[:2] + df1.A
~/miniconda3/envs/vaex/lib/python3.8/site-packages/vaex/expression.py in f(a, b)
111 else:
112 if isinstance(b, Expression):
--> 113 assert b.ds == a.ds
114 b = b.expression
115 elif isinstance(b, (np.timedelta64)):
AssertionError:
It seems like this doesn't come up for vaex, which AFAICT doesn't implement binary operators (can you confirm that @maartenbreddels?) Do other dataframe implementors want to weigh in on what's desired / feasible here? From Dask's perspective, alignment is doable. We partition by divisions on the index. When those divisions aren't available a full shuffle is needed to do the operation. |
Trying to structure a bit the discussion, this is how I see the different components of what is being discussed here (with an example): >>> import pandas
>>> df = pandas.DataFrame({'country': ['France', 'USA', 'UK'],
... 'capital': ['Paris', 'DC', 'London']})
>>> df
country capital
0 France Paris
1 USA DC
2 UK London Basic case, same size, same index, in the same order (I guess whatever we do, it will work): >>> df['country'] + ' - ' + df['capital']
0 France - Paris
1 USA - DC
2 UK - London
dtype: object Same size and index, but index in different order. With row labels and automatic alignment, what we have is: >>> df['country'] + ' - ' + df['capital'].sort_values()
0 France - Paris
1 USA - DC
2 UK - London
dtype: object Without row labels (or without automatic alignment), I guess we would operate by row id, and rely on sorting for the alignment When the size of the dataframes is different, with automatic alignment, pandas fills with >>> df['country'] + ' - ' + df[df.capital.str.len() > 3]['capital']
0 France - Paris
1 NaN
2 UK - London
dtype: object Without row labels, I guess the best solution would probably be to fail if the size is different, and rely on a join / reindex to force the user to make the alignment explicitly So, correct me if I'm wrong, but I think the decisions that need to be made regarding alignment are:
|
Thanks for the summary Marc. I think your three bullets perfectly capture
the three levels to this issue.
I suppose there might be one more question: Do we leave binary operations
between DataFrame objects out of the spec entirely?
That sidesteps the issue of row labels & alignment. And if we do allow
binary operations between
- DataFrame & scalars
- DataFrame & arrays (where an array is an unlabeled column of a dataframe.
Only requirement is that the shape is compatible.)
then perhaps there isn't much of a loss in functionality?
…On Fri, Jun 19, 2020 at 9:55 AM Marc Garcia ***@***.***> wrote:
Trying to structure a bit the discussion, this is how I see the different
components of what is being discussed here (with an example):
>>> import pandas>>> df = pandas.DataFrame({'country': ['France', 'USA', 'UK'],
... 'capital': ['Paris', 'DC', 'London']})>>> df
country capital0 France Paris1 USA DC2 UK London
Basic case, same size, same index, in the same order (I guess whatever we
do, it will work):
>>> df['country'] + ' - ' + df['capital']0 France - Paris1 USA - DC2 UK - Londondtype: object
Same size and index, but index in different order. With row labels and
automatic alignment, what we have is:
>>> df['country'] + ' - ' + df['capital'].sort_values()0 France - Paris1 USA - DC2 UK - Londondtype: object
Without row labels (or without automatic alignment), I guess we would
operate by row id, and rely on sorting for the alignment
df.sort_values('country_id').
When the size of the dataframes is different, with automatic alignment,
pandas fills with NA after aligning, and then operates:
>>> df['country'] + ' - ' + df[df.capital.str.len() > 3]['capital']0 France - Paris1 NaN2 UK - Londondtype: object
Without row labels, I guess the best solution would probably be to fail if
the size is different, and rely on a join / reindex to force the user to
make the alignment explicitly df1 + df1.join(on='country_id', how='left').
So, correct me if I'm wrong, but I think the decisions that need to be
made regarding alignment are:
- Do we want row labels?
- Do we want automatic alignment?
- Do we want to automatically create NA rows if the index values don't
match?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIQWVRNLGXYZWWVGVSTRXN37PANCNFSM4NUM7E3A>
.
|
Yes, basically vaex does not have row labels, so both operations do not make sense in the current state. There is a branch which lets the dataframe behave like a 2d array (nep13/nep18), meaning implicit row labels that are row numbers. In the case of a binary operator it will ignore the column labels, and only use the column index, similar to a 2d array. |
In case it's of interest, in Narwhals we solved this by following the left-hand rule for index alignment: https://narwhals-dev.github.io/narwhals/pandas_like_concepts/pandas_index/ An unexpected benefit of this was that, for Plotly, moving to Narwhals ended up solving some existing bugs for free |
In #2 there seems to be some agreement that row-labels are an important component of a dataframe. Pandas takes this a step further by using them for alignment in many operations involving multiple dataframes.
In the background there's an implicit
a.align(b)
, which reindexes the dataframes to a common index. The resulting index will be the union of the two indices.A few other places this occurs
pd.concat
Do we want to adopt this behavior for the standard?
The text was updated successfully, but these errors were encountered: