-
Notifications
You must be signed in to change notification settings - Fork 0
CSV Parser Notes
https://en.wikipedia.org/wiki/CSV_application_support
File formats: RFC 4180 Creativyst [Perl](http://rath.ca/Misc/Perl_CSV/CSV-2.0.html#csv specification) Intrastat SuperCSV
- Each record is one line ...but: A record separator may consist of a line feed (ASCII/LF=0x0A), or a carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A) ...but: fields may contain embedded line-breaks (see below) so a record may span more than one line.
- Fields are separated with commas.
- Leading and trailing space-characters adjacent to comma field separators are ignored. Space characters can be spaces, or tabs.
- Fields with embedded commas must be delimited with double-quote characters.
- Fields that contain double quote characters must be surounded by double-quotes, and the embedded double-quotes must each be represented by a pair of consecutive double quotes.
- A field that contains embedded line-breaks must be surounded by double-quotes
- Fields with leading or trailing spaces must be delimited with double-quote characters.
- Fields may always be delimited with double quotes.
- The first record in a CSV file may be a header record containing column (field) names
[http://www.creativyst.com/Doc/Std/ctx/ctx.htm](CTX - Creativyst® Table Exchange Format) “A low-overhead alternative for exchanging tabular data”
Encoding File encoding
CommentTokens Defines comment tokens. A comment token is a string that, when placed at the beginning of a line, indicates that the line is a comment and should be ignored by the parser.
Delimiters Defines the delimiters for a text file.
HasFieldsEnclosedInQuotes Denotes whether fields are enclosed in quotation marks when a delimited file is being parsed.
TrimWhiteSpace Indicates whether leading and trailing white space should be trimmed from field values.
fgetcsv
array fgetcsv ( resource $handle [, int $length = 0 [, string $delimiter = ',' [, string$enclosure = '"' [, string $escape = '\\' ]]]] )
length Must be greater than the longest line (in characters) to be found in the CSV file (allowing for trailing line-end characters). It became optional in PHP 5. Omitting this parameter (or setting it to 0 in PHP 5.0.4 and later) the maximum line length is not limited, which is slightly slower.
delimiter Set the field delimiter (one character only).
enclosure Set the field enclosure character (one character only).
escape Set the escape character (one character only). Defaults as a backslash.
Returns an indexed array containing the fields read.
Note: A blank line in a CSV file will be returned as an array comprising a single null field, and will not be treated as an error.
Note: If PHP is not properly recognizing the line endings when reading files either on or created by a Macintosh computer, enabling the auto_detect_line_endings run-time configuration option may help resolve the problem.
fgetcsv() returns NULL if an invalid handle is supplied or FALSE on other errors, including end of file.
convert_encoding: boolean input_encoding: encoding heading: use first line/entry as field names fields: override field names delimiter: comma enclosure: double quote auto_non_chars: characters to ignore when attempting to auto-detect delimiter auto_preferred: preferred delimiter characters, only used when all filtering method returns multiple possible delimiters (happens very rarely) linefeed: “\r\n” (output line separator)
Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe
The authors do not intend to address any of these related topics: • data interpretation (is a field containing the string "10" supposed to be a string, a float or an int? is it a number in base 10, base 16 or base 2? is a number in quotes a number or a string?) • locale-specific data representation (should the number 1.23 be written as "1.23" or "1,23" or "1 23"?) -- this may eventually be addressed. • fixed width tabular data - can already be parsed reliably.
csv.reader(csvfile, dialect='excel', **fmtparams)
dialect: string (e.g. csv.excel, csv.excel_tab)
Dialect.delimiter A one-character string used to separate fields. It defaults to ','.
Dialect.doublequote Controls how instances of quotechar appearing inside a field should be themselves be quoted. When True, the character is doubled. When False, the escapechar is used as a prefix to the quotechar. It defaults to True. On output, if doublequote is False and no escapechar is set, Error is raised if a quotechar is found in a field.
Dialect.escapechar A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False. On reading, the escapechar removes any special meaning from the following character. It defaults to None, which disables escaping.
Dialect.lineterminator The string used to terminate lines produced by the writer. It defaults to '\r\n'. Note The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future.
Dialect.quotechar A one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters. It defaults to '"'.
Dialect.quoting Controls when quotes should be generated by the writer and recognised by the reader. It can take on any of the QUOTE_* constants (see sectionModule Contents) and defaults to QUOTE_MINIMAL.
Dialect.skipinitialspace When True, whitespace immediately following the delimiter is ignored. The default is False.
Dialect.strict When True, raise exception Error on bad CSV input. The default is False.
csv.QUOTE_ALL Instructs writer objects to quote all fields.
csv.QUOTE_MINIMAL Instructs writer objects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters inlineterminator.
csv.QUOTE_NONNUMERIC Instructs writer objects to quote all non-numeric fields. Instructs the reader to convert all non-quoted fields to type float.
csv.QUOTE_NONE Instructs writer objects to never quote fields. When the current delimiter occurs in output data it is preceded by the current escapechar character. If escapechar is not set, the writer will raise Error if any characters that require escaping are encountered. Instructs reader to perform no special processing of quote characters.
class csv.Sniffer The Sniffer class is used to deduce the format of a CSV file.
The Sniffer class provides two methods:
sniff(sample, delimiters=None) Analyze the given sample and return a Dialect subclass reflecting the parameters found. If the optional delimiters parameter is given, it is interpreted as a string containing possible valid delimiter characters.
has_header(sample) Analyze the sample text (presumed to be in CSV format) and return True if the first row appears to be a series of column headers.
We are making the following assumptions : • The record terminator is only one character in length. • The field terminator is only one character in length. • The fields are enclosed by single characters, if any. • The parser can handle documents where fields are always enclosed, not enclosed at all or optionally enclosed. • When fields are strictly all enclosed, there is an assumption that any enclosure characters within the field are escaped by placing a backslash in front of the enclosure character.
The CSV files can be parsed in 3 modes. • (a) No enclosures • (b) Fields always enclosed. • (c) Fields optionally enclosed.
field_terminator: , line_terminator: \n enclosure_char: “ set_skip_lines: 1
possibleDelimiters: , Comma ; Semicolon \t Tab Symbol | Pipe Symbol Space \0 null
supportedLineEndings:
- \n: Unix/Mac OS X Line Endings (LF)
- \r: Classic Mac Line Endings (CR)
- \r\n: Windows Line Endings (CRLF)
_delimiter _recognizesBackslashesAsEscapes _sanitizesFields _recognizesComments _stripsLeadingAndTrailingWhitespace
CSVDocument “CsvDocument library is a unit contaning set of classes for CSV files handling. The library was created to exchange data with OpenOffice Calc / MS Office Excel using CSV as intermediate format.”
Support for line breaks embedded into CSV fields. It was one of the reasons to reinvent the wheel. OO Calc supports this feature as well, but MS Excel does not.
IgnoreOuterWhitespace RemoveTrailingEmptyCells EqualColCountPerRow ADelimiter WithHeader (boolean)
default settings = RFC 4180-compliant: ADelimiter: comma QuoteChar: double-quote Line Ending: CRLF IgnoreOuterWhitespace: false EqualColCountPerRow: true
“pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal. pandas is well suited for many different kinds of data:
- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure“
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None,doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False,lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None,skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None,false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine='c',delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False,low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True,keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False,keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None,iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False,mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False)
Parameters : filepath_or_buffer : string or file handle / StringIO. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv
sep : string, default ‘,’ Delimiter to use. If sep is None, will try to automatically determine this. Regular expressions are accepted.
lineterminator : string (length 1), default None Character to break file into lines. Only valid with C parser
quotechar : string (length 1) The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
quoting : int or csv.QUOTE_* instance, default None Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3). Default (None) results in QUOTE_MINIMAL behavior.
skipinitialspace : boolean, default False Skip spaces after delimiter
escapechar : string
dtype : Type name or dict of column -> type Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}
compression : {‘gzip’, ‘bz2’, None}, default None For on-the-fly decompression of on-disk data
dialect : string or csv.Dialect instance, default None If None defaults to Excel dialect. Ignored if sep longer than 1 char See csv.Dialect documentation for more details
header : int row number(s) to use as the column names, and the start of the data. Defaults to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns E.g. [0,1,3]. Intervening rows that are not specified will be skipped. (E.g. 2 in this example are skipped)
skiprows : list-like or integer Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file
index_col : int or sequence or False, default None Column to use as the row labels of the DataFrame. If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to not use the first column as the index (row names)
names : array-like List of column names to use. If file contains no header row, then you should explicitly pass header=None
prefix : string or None (default) Prefix to add to column numbers when no header, e.g ‘X’ for X0, X1, ...
na_values : list-like or dict, default None Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values
true_values : list Values to consider as True
false_values : list Values to consider as False
keep_default_na : bool, default True If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to
parse_dates : boolean, list of ints or names, list of lists, or dict If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column. If 1, 3 -> combine columns 1 and 3 and parse as a single date column. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’ A fast-path exists for iso8601-formatted dates.
keep_date_col : boolean, default False If True and parse_dates specifies combining multiple columns then keep the original columns.
date_parser : function Function to use for converting a sequence of string columns to an array of datetime instances. The default uses dateutil.parser.parser to do the conversion.
dayfirst : boolean, default False DD/MM format dates, international and European format
thousands : str, default None Thousands separator
comment : str, default None Indicates remainder of line should not be parsed Does not support line commenting (will return empty line)
decimal : str, default ‘.’ Character to recognize as decimal point. E.g. use ‘,’ for European data
nrows : int, default None Number of rows of file to read. Useful for reading pieces of large files
iterator : boolean, default False Return TextFileReader object
chunksize : int, default None Return TextFileReader object for iteration
skipfooter : int, default 0 Number of line at bottom of file to skip
converters : dict. optional Dict of functions for converting values in certain columns. Keys can either be integers or column labels
verbose : boolean, default False Indicate number of NA values placed in non-numeric columns
delimiter : string, default None Alternative argument name for sep. Regular expressions are accepted.
encoding : string, default None Encoding to use for UTF when reading/writing (ex. ‘utf-8’)
squeeze : boolean, default False If the parsed data only contains one column then return a Series
na_filter: boolean, default True Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file
usecols : array-like Return a subset of the columns. Results in much faster parsing time and lower memory usage.
mangle_dupe_cols: boolean, default True Duplicate columns will be specified as ‘X.0’...’X.N’, rather than ‘X’...’X’
tupleize_cols: boolean, default False Leave a list of tuples on columns as is (default is to convert to a Multi Index on the columns)
error_bad_lines: boolean, default True Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser).
warn_bad_lines: boolean, default True If error_bad_lines is False, and warn_bad_lines is True, a warning for each “bad line” will be output. (Only valid with C parser).
infer_datetime_format : boolean, default False If True and parse_dates is enabled for a column, attempt to infer the datetime format to speed up the processing
LOAD DATA [LOW_PRIORITY | CONCURRENT] [LOCAL] INFILE 'file_name'
[REPLACE | IGNORE]
INTO TABLE tbl_name
[CHARACTER SET charset_name]
[{FIELDS | COLUMNS}
[TERMINATED BY 'string']
[[OPTIONALLY] ENCLOSED BY 'char']
[ESCAPED BY 'char']
]
[LINES
[STARTING BY 'string']
[TERMINATED BY 'string']
]
[IGNORE number LINES]
[(col_name_or_user_var,...)]
[SET col_name = expr,...]
CHARACTER SET: encoding FIELDS: TERMINATED BY: separator ENCLOSED BY: enclosure ESCAPED BY: escape char LINES: STARTING BY: prefix to skip on each line TERMINATED BY: line ending
defaults: FIELDS TERMINATED BY '\t' ENCLOSED BY '' ESCAPED BY '\' LINES TERMINATED BY '\n' STARTING BY ''
This specification uses the syntax described in Appendix A of the first edition of O'Reilly's Programming Perl book (i.e., the Perl 4 camel book).
CSV_RECORD ::= (* FIELD DELIM ) FIELD REC_SEP
FIELD ::= QUOTED_TEXT | TEXT
DELIM ::= ,' REC_SEP ::=
\n'
TEXT ::= LIT_STR | ["] LIT_STR [^"] | [^"] LIT_STR ["]
LIT_STR ::= ( LITERAL_CHAR )
LITERAL_CHAR ::= NOT_COMMA_NL
NOT_COMMA_NL ::= [^,\n]
QUOTED_TEXT ::= ["] ( NOT_A_QUOTE *) ["]
NOT_A_QUOTE ::= [^"] | ESCAPED_QUOTE
ESCAPED_QUOTE ::= `""'
f_encoding=utf8
csv_eol=\n
csv_sep_char=;
csv_quote_char="
csv_escape_char=
csv_class=Text::CSV_XS
csv_null=1
encoding
sep_char
headers If headers is given, it should be either an anonymous list of column names or a flag: auto or skip. When skip is used, the header will not be included in the output.
fragment: e.g. “row=3;6-9;15-*” Only output the fragment as defined in the "fragment" method
-
delimiter
Set the field delimiter. One character only, defaults to comma. -
rowDelimiter
String used to delimit record rows or a special value; special values are 'auto', 'unix', 'mac', 'windows', 'unicode'; defaults to 'auto' (discovered in source or 'unix' if no source is specified). -
quote
Optional character surrounding a field, one character only, defaults to double quotes. -
escape
Set the escape character, one character only, defaults to double quotes. -
columns
List of fields or true if autodiscovered in the first CSV line, default to null. Impact thetransform
argument and thedata
event by providing an object instead of an array, order matters, see the transform and the columns sections for more details. -
comment
Treat all the characteres after this one as a comment, default to '#' -
flags
Used to read a file stream, default to the r character. -
encoding
Encoding of the read stream, defaults to 'utf8', applied when a readable stream is created. -
trim
If true, ignore whitespace immediately around the delimiter, defaults to false. -
ltrim
If true, ignore whitespace immediately following the delimiter (i.e. left-trim all fields), defaults to false. -
rtrim
If true, ignore whitespace immediately preceding the delimiter (i.e. right-trim all fields), defaults to false.
[datatool]
\DTLloaddb[noheader,keys={,ColW5, ColW6, ColW7}]{myDB}{MyData.csv}
:col_sep: The String placed between each field.
:row_sep:
The String appended to the end of each row. This can be set to the special :auto setting, which requests that FasterCSV automatically discover this from the data. Auto-discovery reads ahead in the data looking for the next "\r\n", "\n", or "\r" sequence. A sequence will be selected even if it occurs in a quoted field, assuming that you would have the same line endings there. If none of those sequences is found, data is ARGF, STDIN, STDOUT, or STDERR, or the stream is only available for output, the default
:quote_char: The character used to quote fields. This has to be a single character String. This is useful for application that incorrectly use ‘ as the quote character instead of the correct ". FasterCSV will always consider a double sequence this character to be an escaped quote.
:encoding:
The encoding to use when parsing the file. Defaults to your $KDOCE setting. Valid values: n’ or
N’ for none, e’ or
E’ for EUC, s’ or
S’ for SJIS, and u’ or
U’ for UTF-8 (see Regexp.new()).
:field_size_limit: This is a maximum size FasterCSV will read ahead looking for the closing quote for a field. (In truth, it reads to the first line ending beyond this size.) If a quote cannot be found within the limit FasterCSV will raise a MalformedCSVError, assuming the data is faulty. You can use this limit to prevent what are effectively DoS attacks on the parser. However, this limit can cause a legitimate parse to fail and thus is set to nil, or off, by default.
:converters: An Array of names from the Converters Hash and/or lambdas that handle custom conversion. A single converter doesn‘t have to be in an Array.
:unconverted_fields: If set to true, an unconverted_fields() method will be added to all returned rows (Array or FasterCSV::Row) that will return the fields as they were before convertion. Note that :headers supplied by Array or String were not fields of the document and thus will have an empty Array attached.
:headers: If set to :first_row or true, the initial row of the CSV file will be treated as a row of headers. If set to an Array, the contents will be used as the headers. If set to a String, the String is run through a call of FasterCSV::parse_line() with the same :col_sep, :row_sep, and :quote_char as this instance to produce anArray of headers. This setting causes FasterCSV.shift() to return rows as FasterCSV::Row objects instead of Arrays and FasterCSV.read() to returnFasterCSV::Table objects instead of an Array of Arrays.
:return_headers: When false, header rows are silently swallowed. If set to true, header rows are returned in a FasterCSV::Row object with identical headers and fields (save that the fields do not go through the converters). :header_converters: Identical in functionality to :converters save that the conversions are only made to header rows.
:skip_blanks: When set to a true value, FasterCSV will skip over any rows with no content.
:unconverted_fields If set to true, an unconverted_fields() method will be added to all returned rows (Array or CSV::Row) that will return the fields as they were before conversion. Note that :headers supplied by Array or String were not fields of the document and thus will have an empty Array attached.
:skip_lines When set to an object responding to match, every line matching it is considered a comment and ignored during parsing. When set to a String, it is first converted to a Regexp. When set to nil no line is considered a comment. If the passed object does not respond to match, ArgumentError is thrown.
M = csvread(filename) reads a comma-separated value formatted file, filename. The file can only contain numeric values.
M = csvread(filename,row,col) reads data from the file starting at the specified row and column. The row and column arguments are zero based, so that row = 0 and col = 0 specify the first value in the file.
M = csvread(filename,row,col,csvRange) reads only the range specified by csvRange.
If csvRange is a 4-element vector, then it must have the form [R1,C1,R2,C2], where (R1,C1) is the upper left corner of the data to be read and (R2,C2) is the lower right corner. The range is zero based, so that R1 = 0 specifies the first row of data, and C1 = 0 specifies the first column of data.
If csvRange is a string, then it should be specified using spreadsheet notation, as in csvRange = 'A1..B7'.
[http://members.optusnet.com.au/apicard/csv-parser.lisp]
SKIP-LINES, if provided, is the number of lines to skip FIELD-SEPARATOR QUOTE-CHARACTER
Import["file.csv"] imports a CSV file, returning an array.
Import["file.csv"] returns a list of lists containing strings and numbers, representing the rows and columns stored in the file.
Import["file.csv", elem] imports the specified element from a CSV file.
CharacterEncoding: “ASCII” raw character encoding used in the file TextDelimiters: Automatic string or list of strings used to delimit non-numeric fields CurrencyTokens: {{"$", "", "", ""},{"c", "", "p", "F"}} currency units to be skipped when importing numerical values DateStringFormat: None date format, given as a DateStringspecification EmptyField: “” how to represent empty fields HeaderLines: 0 number of lines to skip at the beginning of the file IgnoreEmptyLines: False whether to ignore empty lines Numeric: True whether to import data fields as numbers if possible NumberSigns: {"-","+"} strings to use for signs of negative and positive numbers
Import converts table entries formatted as specified by the option to a DateList representation of the form . With "Numeric" -> False, numbers will be imported as strings in the form they appear in the file. Import automatically recognizes all common conventions for the encoding of line-separator characters.
POJO support Read or write using any old Javabean. Perform deep mapping and index-based mapping using the new Dozer extension! For the old-fashioned, you can read or write with Lists and Maps as well.
Automatic CSV encoding Forget about handling special characters such as commas and double-quotes - Super CSV will take care of that for you! All content is properly escaped/un-escaped according to the CSV specification.
Highly configurable Choose your own delimiter, quote character and line separator - or just use one of the predefined configurations. Comma-separated, tab-separated, semicolon-separated (Germany/Denmark) - it's all possible.
Data conversion Powerful cell processors make it simple to parse input (to Booleans, Integers, Dates, etc), transform values (trimming Strings, doing regular expression replacement, etc) and format output like Dates and Numbers.
Constraint validation Verify that your data conforms to one or more constraints, such as number ranges, string lengths or uniqueness.
Stream-based I/O Operates on streams rather than filenames, and gives you the control to flush or close the streams when you want. Write to a file, over the network, to a zip file, whatever!
Super CSV accepts all line breaks (Windows, Mac or Unix) when reading CSV files, and uses the end of line symbols specified by the user (via theCsvPreference object) when writing CSV files. Super CSV will add a line break when writing the last line of a CSV file, but a line break on the last line is optional when reading. Super CSV provides methods for reading and writing headers, if required. It also makes use of the header for mapping between CSV and POJOs (seeCsvBeanReader/CsvBeanWriter).
The delimiter in Super CSV is configurable via the CsvPreference object, though it is typically a comma.
Super CSV expects each line to contain the same number of fields (including the header). In cases where the number of fields varies, CsvListReader/CsvListWriter should be used, as they contain methods for reading/writing lines of arbitrary length.
Super CSV escapes double-quotes with a preceding double-quote. Please note that the sometimes-used convention of escaping double-quotes as " (instead of "") is not supported.
Constant Quote character Delimiter character End of line symbols
STANDARD_PREFERENCE " , \r\n
EXCEL_PREFERENCE " , \n
EXCEL_NORTH_EUROPE_PREFERENCE " ; \n
TAB_PREFERENCE " \t \n
surroundingSpacesNeedQuotes: false true = surrounding spaces without quotes will be trimmed when reading, and quotes will automatically be added for Strings containing surrounding spaces when writing.
getDelimiterChar
getEndOfLineSymbols
getQuoteChar
isSurroundingSpacesNeedQuotes
getQuoteMode (e.g. AlwaysQuoteMode, ColumnQuoteMode, NormalQuoteMode)
getCommentMatcher (e.g. CommentStartsWith)
When using NormalQuoteMode surrounding quotes are only applied if required to escape special characters (per RFC4180). When using ColumnQuoteMode surrounding quotes are only applied if required to escape special characters (per RFC4180), or if a particular column should always be quoted. When using AlwaysQuoteMode surrounding quotes are always applied.
Convenience functions read.csv and read.delim provide arguments to read.table appropriate for CSV and tab-delimited files exported from spreadsheets in English-speaking locales. The variations read.csv2 and read.delim2 are appropriate for use in those locales where the comma is used for the decimal point and (for read.csv2) for spreadsheets which use semicolons to separate fields.
- Encoding (fileEncoding) If the file contains non-ASCII character fields, ensure that it is read in the correct encoding.
- Header line (header = true, row.names = 1) We recommend that you specify the header argument explicitly, Conventionally the header line has entries only for the columns and not for the row labels, so is one field shorter than the remaining lines. (If R sees this, it sets header = TRUE.)
- Separator (sep) Normally looking at the file will determine the field separator to be used, but with white-space separated files there may be a choice between the default sep = "" which uses any white space (spaces, tabs or newlines) as a separator, sep = " " and sep = "\t". Note that the choice of separator affects the input of quoted strings.
- Quoting (quote) By default character strings can be quoted by either ‘"’ or ‘'’, and in each case all the characters up to a matching quote are taken as part of the character string. The set of valid quoting characters (which might be none) is controlled by the quote argument. For sep = "\n" the default is changed to quote = "". If no separator character is specified, quotes can be escaped within quoted strings by immediately preceding them by ‘\’, C-style. If a separator character is specified, quotes can be escaped within quoted strings by doubling them as is conventional in spreadsheets.
- Missing values (na.strings) By default the file is assumed to contain the character string NA to represent missing values, but this can be changed by the argument na.strings, which is a vector of one or more character representations of missing values. Empty fields in numeric columns are also regarded as missing values. In numeric columns, the values NaN, Inf and -Inf are accepted.
- Unfilled lines (fill) It is quite common for a file exported from a spreadsheet to have all trailing empty fields (and their separators) omitted. To read such files set fill = TRUE.
- White space in character fields (strip.white) If a separator is specified, leading and trailing white space in character fields is regarded as part of the field. To strip the space, use argument strip.white = TRUE.
- Blank lines (blank.lines.skip) By default, read.table ignores empty lines. This can be changed by setting blank.lines.skip = FALSE, which will only be useful in conjunction with fill = TRUE, perhaps to use blank rows to indicate missing cases in a regular layout.
- Classes for the variables (colClasses, as.is) Unless you take any special action, read.table reads all the columns as character vectors and then tries to select a suitable class for each variable in the data frame. It tries in turn logical, integer, numeric and complex, moving on if any entry is not missing and cannot be converted.4 If all of these fail, the variable is converted to a factor. Arguments colClasses and as.is provide greater control. Specifying as.is = TRUE suppresses conversion of character vectors to factors (only). Using colClasses allows the desired class to be set for each column in the input: it will be faster and use less memory. Note that colClasses and as.is are specified per column, not per variable, and so include the column of row names (if any).
- Comments (comment.char) By default, read.table uses ‘#’ as a comment character, and if this is encountered (except in quoted strings) the rest of the line is ignored. Lines containing only white space and a comment are treated as blank lines. If it is known that there will be no comments in the data file, it is safer (and may be faster) to use comment.char = "".
- Escapes (allowEscapes) Many OSes have conventions for using backslash as an escape character in text files, but Windows does not (and uses backslash in path names). It is optional in R whether such conventions are applied to data files. Both read.table and scan have a logical argument allowEscapes. This is false by default, and backslashes are then only interpreted as (under circumstances described above) escaping quotes. If this set to be true, C-style escapes are interpreted, namely the control characters \a, \b, \f, \n, \r, \t, \v and octal and hexadecimal representations like \040 and \0x2A. Any other escaped character is treated as itself, including backslash. Note that Unicode escapes such as \uxxxx are never interpreted.
- Encoding (fileEncoding) Note: Some people claim that UTF-8 files should never have a BOM, but some software (apparently including Excel:mac) uses them, and many Unix-alike OSes do not accept them.