-
Notifications
You must be signed in to change notification settings - Fork 0
CSV Parsing
Aside: in many places where a character or boolean is specified, it could alternatively be a list, regular expression, function or class.
-
Open the file as a stream, with a specific encoding.
Parameter: encoding Default: UTF-8
-
Find the rows.
Need to define the escape character first, as it changes the meaning of all other special characters:
Parameter: escape Default: \ Options: character, false Note: not supported in all parsers
Need to define the enclosure character early, as row terminators and column separators within enclosures do not count as special characters:
Parameter: enclosure Default: "
Note: the enclosure character repeated twice in a field will count as one character, if not escaped (Excel legacy)
Note: cells may either be enclosed always, sometimes (e.g. if special characters/non-integers occur in values) or never.
Parameter: row_terminator Default: \n Examples: \n, \r, \r\n Examples: LF, CR, CRLF Examples: unix, windows, mac, unicode
-
Skip n rows.
Parameter: skip_rows Default: 0
These rows could be any data at the start of the file, but are often comments (see below), free text description of the table, provenance and/or other metadata.
-
(optional) In the skipped rows, find comment rows.
Parameter: comment_prefix Default: #
-
Recognise n header rows.
Parameter: header_rows Default: 0
-
Split each row into columns.
Parameter: column_separator Default: ,
Note: this character inside an enclosure does not count as a separator.
-
Skip n columns of each row.
Parameter: skip_columns Default: 0
-
Recognise n header columns.
Parameter: header_columns Default: 0
-
Read the values of each header cell.
Remove insignificant whitespace:
Parameter: trim Default: true Options: true, false, start, end
Note: whitespace inside an enclosure is always significant
-
Build a key for each column, using values from the header row(s).
Note: skip header columns.
Parameter: fields Default: [] (read from header row or column index) Parameter: column_prefix Default: null * If field names are provided, use them as the key. * If no field names but a header row is present, use the header cell values as keys. * If there are multiple header rows, use an array of cell values. * If no field names are provided and there is no header row, use the column index (with optional column prefix). * If the same key is repeated multiple times, add an incrementing suffix to the key.
-
Read each non-header row in the table and build a key/value data table.
Skip blank rows, optionally:
Parameter: skip_blank_rows Default: true
Blank rows may be significant, if they are used to demarcate table sections or represent missing measurements.
Parameter: fill_rows Default: true Description: If there are less cells in the row than fields, add empty values. Parameter: trim_rows Default: false Description: Remove empty cells from the end of the row Note: is this needed?
If there are more cells in the row than fields, throw an exception.
To generate the key for each row, use the value from any header column(s), or the row index.
Combine column and row keys to make the cell key.