Skip to content

Append Datafarmes #6806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pi-crust opened this issue Aug 24, 2023 · 3 comments · Fixed by #6808
Closed

Append Datafarmes #6806

pi-crust opened this issue Aug 24, 2023 · 3 comments · Fixed by #6808

Comments

@pi-crust
Copy link

pi-crust commented Aug 24, 2023

System Information (please complete the following information):

  • .NET Version: [e.g. .NET 6.0.301]

Describe the bug

Hi, Appending Dataframes (DataFrame.Append Method) doesn't seem to append based on column names, but appends based on the position. Is this normal?

To Reproduce

`var data = new DataFrame();

var col1 = new StringDataFrameColumn("ColumnA", new string[] { "a", "b", "c", "d", "e" });
var col2 = new Int32DataFrameColumn("ColumnB", new int[] { 1, 2, 3, 4, 5 });
var col3 = new Int32DataFrameColumn("ColumnC", new int[] { 10, 20, 30, 40, 50 });
var col4 = new StringDataFrameColumn("ColumnA", new string[] { "f", "g", "c", "d", "e" });
var col5 = new Int32DataFrameColumn("ColumnB", new int[] { 6, 7, 3, 4, 5 });
var col6 = new Int32DataFrameColumn("ColumnC", new int[] { 100, 200, 300, 400, 500 });

var dataFrame1 = new DataFrame(col1, col2, col3);
var dataFrame2 = new DataFrame(col4, col6, col5);

var dataFrames = new List { dataFrame1, dataFrame2 };
var resultDataFrame = dataFrame1.Append(dataFrame2.Rows);`

Exhibited behavior

index ColumnA ColumnB ColumnC
0 a 1 10
1 b 2 20
2 c 3 30
3 d 4 40
4 e 5 50
5 f 100 6
6 g 200 7
7 c 300 3
8 d 400 4
9 e 500 5

Expected behavior

index ColumnA ColumnB ColumnC
0 a 1 10
1 b 2 20
2 c 3 30
3 d 4 40
4 e 5 50
5 f 6 100
6 g 7 200
7 c 3 300
8 d 4 400
9 e 5 500
@ghost ghost added the untriaged New issue has not been triaged label Aug 24, 2023
@asmirnov82
Copy link
Contributor

asmirnov82 commented Aug 24, 2023

Hi @pi-curst , I would say that this is expected behavior. Append method gets IEnumerable rows as an argument. Internaly for each row in the provided sequence this method calls Append(IEnumerable row). So for each DataFrameRow object IEnumerable is called and it works based on column position in DataFrame.Columns collection (column index), column names are just ignored:

public class DataFrameRow : IEnumerable<object>
{
    ...
    /// <summary>
    /// Returns an enumerator of the values in this row.
    /// </summary>
    public IEnumerator<object> GetEnumerator()
    {
        foreach (DataFrameColumn column in _dataFrame.Columns)
        {
            yield return column[_rowIndex];
        }
    }
    ...
}

So actualy your code looks similar to this:

...
dataFrame1.Append(new object[] { "f", 100, 6 }, true);
dataFrame1.Append(new object[] { "g", 200, 7 }, true);
dataFrame1.Append(new object[] { "c", 300, 3 }, true);
...
var resultDataFrame = dataFrame1.Append(new object[] { "e", 500, 5 }, true);

@asmirnov82
Copy link
Contributor

asmirnov82 commented Aug 24, 2023

However, I checked the same behaviour in pandas dataframe and it works as you expected:

import pandas as pd

d1 = {'ColumnA': ['a', 'b', 'c'], 'ColumnB': [1, 2, 3], 'ColumnC': [10, 20, 30]}
df1 = pd.DataFrame(data=d1)

d2 = {'ColumnA': ['d', 'e', 'f'], 'ColumnC': [40, 50, 60], 'ColumnB': [4, 5, 6]}
df2 = pd.DataFrame(data=d2)

df3=df1.append(df2)

print(df3.to_string())

Result:

      ColumnA    ColumnB  ColumnC
0        a         1          10
1        b         2          20
2        c         3          30
0        d         4          40
1        e         5          50
2        f         6          60

@JakeRadMSFT, @luisquintanilla what do you think? Should it be changed inline with Pandas behavior?
I am ready to work on a fix if required

@luisquintanilla
Copy link
Contributor

luisquintanilla commented Aug 31, 2023

Hi @asmirnov82,

Thanks for the replies and investigations. I can see why it works the way it does now, but I too would expect that if the column names are the same, appending two DataFrames would require that their column names match up and by extension would append the values based on the column names.

@JakeRadMSFT thoughts?

@ghost ghost added the in-pr label Sep 1, 2023
@ghost ghost removed in-pr untriaged New issue has not been triaged labels Sep 2, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Oct 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants