Skip to content

feat(import,csv,psv): Add support for importing CSV and PSV files without header rows #9204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

codeaucafe
Copy link
Contributor

@codeaucafe codeaucafe commented May 12, 2025

Summary

  • Add --no-header flag to treat the first row in CSV/PSV files as data instead of column names
  • Add --columns option to specify column names when importing files without headers
  • Fix nil pointer panic when importing from stdin with --create-table

In short, this feature makes Dolt more compatible with MySQL/SQLite workflows and provides users with more flexibility when importing data.

Problem

Previously, Dolt always expected the first row of CSV/PSV files to contain column names. This differs from MySQL and SQLite which support importing files where the first row contains data. Users migrating from these systems or working with headerless data files couldn't import them without modifying their files.

Additionally, when users attempted to import data from stdin using --create-table, they would encounter a nil pointer panic instead of receiving a error message.

Solution

The implementation adds:

  1. A new --no-header flag that treats the first row as data instead of column headers
  2. A complementary --columns option to specify column names when headers aren't present
  3. Proper validation to ensure correct flag combinations
  4. Comprehensive error handling for stdin imports with clear error messages
  5. Integration tests for both CSV and PSV files

Testing

  • Added integration tests for both CSV and PSV files that verify:
  • Importing files with --no-header and --columns options
  • Error cases when required options are missing
  • Original behavior is maintained when not using --no-header
  • Behavior of --columns with and without --no-header
  • Edge cases like stdin imports

@codeaucafe codeaucafe force-pushed the codeaucafe/feat/7831/allow-import-csv-without-header-row branch from fda0260 to bd40600 Compare May 12, 2025 02:32
@timsehn
Copy link
Contributor

timsehn commented May 12, 2025

running workflows

@codeaucafe codeaucafe changed the title Add support for importing CSV files without header rows feat(import,csv,psv): Add support for importing CSV and PSV files without header rows May 12, 2025
Copy link
Contributor

@nicktobey nicktobey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, David! This is a very clean fix.

It seems like the newly added tests aren't currently passing though, and I left a couple of comments pertaining to documentation, tests, and code duplication. After that, this should be good to merge.

@nicktobey
Copy link
Contributor

Don't worry about the "Benchmark SQL Correctness" check failing: that's a known issue for PRs that come from forks.

If the other checks are green, we'll be good.

@codeaucafe codeaucafe force-pushed the codeaucafe/feat/7831/allow-import-csv-without-header-row branch 3 times, most recently from 4c86051 to 700a067 Compare May 13, 2025 06:40
@codeaucafe
Copy link
Contributor Author

codeaucafe commented May 13, 2025

hi @nicktobey I think I got it in a better position, so its "ready for review". FYI I realized trying to use no-headers and columns while streaming in the csv was panicking. So, I added error handling where // todo: capture stream data to file so we can use schema inference was. Not sure if that's right or not. I made some comments about it, please review them.

thanks for reviewing.

@codeaucafe codeaucafe marked this pull request as ready for review May 13, 2025 06:42
@codeaucafe codeaucafe force-pushed the codeaucafe/feat/7831/allow-import-csv-without-header-row branch from 700a067 to 597b19a Compare May 13, 2025 07:05
// CreateCSVInfo creates a CSVInfo object based on the provided options.
// This is a helper function that extracts and processes CSV-related options
// from the generic options interface.
func CreateCSVInfo(opts interface{}, defaultDelim string) *csv.CSVFileInfo {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how y'all feel about this file name. Wasn't sure if I should name it after helper func, or keep it a generic helper file. 🤷‍♂️

@@ -325,6 +330,14 @@ func validateImportArgs(apr *argparser.ArgParseResults) errhand.VerboseError {
return errhand.BuildDError("parameters %s and %s are mutually exclusive", allTextParam, schemaParam).Build()
}

if apr.Contains(noHeaderParam) && !apr.Contains(columnsParam) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar handling in the issues I had testing with std streaming of csv

Copy link
Contributor Author

@codeaucafe codeaucafe May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicktobey FYI, this ^ comment is somewhat confusing. I meant in that I was similarly handling configs based on table import options, not panic handling from stdin streaming of csv

This feature allows users to import CSV files where the first row
contains data rather than column names, matching the behavior found in
MySQL and SQLite:

- Add new flag --no-header to treat first row as data instead of column
headers
- Add new option --columns to specify column names when no header is
present
- Require --columns when creating new tables with --no-header
- Provide helpful error messages with usage guidance
- Update validation to ensure proper combination of options
- Add comprehensive tests for the new functionality

Closes: dolthub#7831
@codeaucafe codeaucafe force-pushed the codeaucafe/feat/7831/allow-import-csv-without-header-row branch 2 times, most recently from 86b031d to ceb25a7 Compare May 13, 2025 07:59
@nicktobey
Copy link
Contributor

Good catch noticing the panic, and thanks for including the fix for that in your PR.

I left one comment about some confusion on my part, but otherwise I like this. I'm running the CI workflows now.

This commit fixes an issue where importing CSV
data from stdin using --create-table would result in a nil pointer
panic. It adds proper error handling for this case by:

1. Improving error checking in validateImportArgs to verify schema file
   is provided when using stdin with --create-table
2. Adding a nil check for rdSchema in newImportSqlEngineMover before
   attempting to access it
3. Updating the parameter description for --columns to clarify that it
   overrides header names when used without --no-header
4. Adding tests to verify error messages and behavior with stdin imports

The improved error handling provides clear, helpful error messages
instead of panicking, and documents the workaround of creating the table
first and then using -u (update) mode for importing.

Closes: dolthub#7831
@codeaucafe codeaucafe force-pushed the codeaucafe/feat/7831/allow-import-csv-without-header-row branch from ceb25a7 to 4a1c1f6 Compare May 14, 2025 06:45
@nicktobey
Copy link
Contributor

This looks good!

I forgot to mention: since the newly added bats test files are testing functionality that only works when run against a local database, we need to make sure that they don't get run when testing remote functionality. (That's why the "Test Bats Unix Remote" test is failing.)

We can address that by adding entries to the SKIP_SERVER_TESTS variable defined in integration-tests/bats/helper/local-remote.bash.

After that, I think this PR is good to go!

(As a side note, I noticed that you've been rebasing your branch in response to PR feedback. While rebasing creates a cleaner commit history, it makes the PR process more confusing because it makes it harder to see what changed, and comments left on the old diff may not be visible on the new diff. Once a PR has comments on it, we try to avoid rebasing.)

@codeaucafe
Copy link
Contributor Author

codeaucafe commented May 14, 2025

Thank you. I'll change that tonight.

Sorry about the rebases. Also, I'm sorry if I missed that in the Contributing markdown.

Thanks again

This update ensures that the new import-no-header-csv.bats and
import-no-header-psv.bats tests are skipped when running remote tests,
as they only work against a local database.
@codeaucafe
Copy link
Contributor Author

Added, thanks again.

@nicktobey nicktobey merged commit 23eea58 into dolthub:main May 16, 2025
19 of 21 checks passed
@nicktobey
Copy link
Contributor

I've merged the PR.

Thank you for implementing this. I've had to work around this in the past, so I'm relieved that it's a built-in feature now.

@codeaucafe
Copy link
Contributor Author

Awesome! Glad to hear that.

Thank you for letting me contribute and reviewing my first Dolt contribution. I really appreciate it.

cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants