Skip to content
Alf Eaton edited this page Mar 5, 2014 · 7 revisions

Aside: in many places where a character or boolean is specified, it could alternatively be a list, regular expression, function or class.

  1. Open the file as a stream, with a specific encoding.

     Parameter: encoding
     Default: UTF-8
    
  2. Find the rows.

    Need to define the escape character first, as it changes the meaning of all other special characters:

     Parameter: escape
     Default: \
     Options: character, false
     Note: not supported in all parsers
    

    Need to define the enclosure character early, as row terminators and column separators within enclosures do not count as special characters:

     Parameter: enclosure
     Default: "
    

    Note: the enclosure character repeated twice in a field will count as one character, if not escaped (Excel legacy)

    Note: cells may either be enclosed always, sometimes (e.g. if special characters/non-integers occur in values) or never.

     Parameter: row_terminator 
     Default: \n
     Examples: \n, \r, \r\n
     Examples: LF, CR, CRLF
     Examples: unix, windows, mac, unicode
    
  3. Skip n rows.

     Parameter: skip_rows
     Default: 0
    

    These rows could be any data at the start of the file, but are often comments (see below), free text description of the table, provenance and/or other metadata.

  4. (optional) In the skipped rows, find comment rows.

     Parameter: comment_prefix
     Default: #
    
  5. Recognise n header rows.

     Parameter: header_rows
     Default: 0
    
  6. Split each row into columns.

     Parameter: column_separator
     Default: ,
    

    Note: this character inside an enclosure does not count as a separator.

  7. Skip n columns of each row.

     Parameter: skip_columns
     Default: 0
    
  8. Recognise n header columns.

     Parameter: header_columns
     Default: 0
    
  9. Read the values of each header cell.

    Remove insignificant whitespace:

     Parameter: trim
     Default: true
     Options: true, false, start, end
    

    Note: whitespace inside an enclosure is always significant

  10. Build a key for each column, using values from the header row(s).

    Note: skip header columns.

    Parameter: fields
    Default: [] (read from header row or column index)
    
    Parameter: column_prefix
    Default: null
    
    * If field names are provided, use them as the key.
    * If no field names but a header row is present, use the header cell values as keys.
    * If there are multiple header rows, use an array of cell values.
    * If no field names are provided and there is no header row, use the column index (with optional column prefix).
    * If the same key is repeated multiple times, add an incrementing suffix to the key.
    
  11. Read each non-header row in the table and build a key/value data table.

    Skip blank rows, optionally:

    Parameter: skip_blank_rows
    Default: true
    

    Blank rows may be significant, if they are used to demarcate table sections or represent missing measurements.

    Parameter: fill_rows
    Default: true
    Description: If there are less cells in the row than fields, add empty values.
    
    Parameter: trim_rows
    Default: false
    Description: Remove empty cells from the end of the row
    Note: is this needed?
    

    If there are more cells in the row than fields, throw an exception.

    To generate the key for each row, use the value from any header column(s), or the row index.

    Combine column and row keys to make the cell key.

Clone this wiki locally