Skip to content

DataFrame.LoadCsv can not load CSV with duplicate column names #6182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
torronen opened this issue May 3, 2022 · 4 comments · Fixed by #6772
Closed

DataFrame.LoadCsv can not load CSV with duplicate column names #6182

torronen opened this issue May 3, 2022 · 4 comments · Fixed by #6772
Labels
enhancement New feature or request Microsoft.Data.Analysis All DataFrame related issues and PRs P2 Priority of the issue for triage purpose: Needs to be fixed at some point.
Milestone

Comments

@torronen
Copy link
Contributor

torronen commented May 3, 2022

Code:
IDataView trainData = DataFrame.LoadCsv(TrainDatasetPath, separator: ';', header: true, guessRows: 100);

Gives exception:
DataFrame already contains a column called Target20 (Parameter 'column')

Suggestion:
It would be nice if LoadCsv would have the option to ignore or auto-rename duplicate columns.
For small CSV files it is not a big problem, but for huge CSV files renaming headers is a hassle.

@torronen torronen added the enhancement New feature or request label May 3, 2022
@ghost ghost added the untriaged New issue has not been triaged label May 3, 2022
@torronen
Copy link
Contributor Author

torronen commented May 3, 2022

If anyone has same problem renamed header names can put in the parameter. This solves my issue. I do not know if LoadCsv should have this functionality inbuilt or not (or, issue to be closed or not)

LoadCsv with renamed columns:

  string line1 = File.ReadLines(TrainDatasetPath).First();
                string[] arr = line1.Split(';');
                var duplicatedItems = arr.GroupBy(a => a)
                                       .Where(g => g.Count() > 0)
                                       .ToDictionary(g => g.Key, g => g.Count());

                for (int i = arr.Length - 1; i >= 0; i--)
                {
                    string item = arr[i];
                    if (!duplicatedItems.ContainsKey(item))
                    {
                        arr[i] = item;
                        continue;
                    }

                    if (duplicatedItems[item] > 1)
                    {
                        arr[i] = String.Format("{0}_{1}", item, duplicatedItems[item]);
                    }
                    
                    duplicatedItems[item]--;
                }

                IDataView trainData = DataFrame.LoadCsv(TrainDatasetPath, separator: ';', header: true, columnNames: arr, guessRows: 100);

@michaelgsharp
Copy link
Contributor

@luisquintanilla @torronen is this something we think should be built in to load? I can see it going both ways honestly. I would probably lean towards having it built in somehow.

@luisquintanilla
Copy link
Contributor

@michaelgsharp by builtin do you mean duplicate columns are automatically renamed like the snippet @torronen shared?

@beccamc
Copy link
Contributor

beccamc commented May 9, 2022

+1 - I also experienced pain around large data files with multiple columns sharing names. I only had less than 50 columns and it was still troublesome to deal with.

@michaelgsharp michaelgsharp added P2 Priority of the issue for triage purpose: Needs to be fixed at some point. Microsoft.Data.Analysis All DataFrame related issues and PRs labels Jun 13, 2022
@michaelgsharp michaelgsharp added this to the ML.NET Future milestone Jun 13, 2022
@ghost ghost removed the untriaged New issue has not been triaged label Jun 13, 2022
@ghost ghost added the in-pr label Jul 24, 2023
@ghost ghost removed the in-pr label Aug 31, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Sep 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request Microsoft.Data.Analysis All DataFrame related issues and PRs P2 Priority of the issue for triage purpose: Needs to be fixed at some point.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants