You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Switch to the eminem library for more efficient parsing of Matrix Market files. (#122)
This avoids unnecessary memory allocations for large matrices, especially if we
use a two-pass approach to avoid an intermediate vector of vectors. We support
multi-threaded parsing without relying on BiocParallel, and we allow direct
reading into dgCMatrices or SVT_SparseMatrices to avoid R-level fiddling.
Copy file name to clipboardExpand all lines: R/read10xCounts.R
+50-26Lines changed: 50 additions & 26 deletions
Original file line number
Diff line number
Diff line change
@@ -13,15 +13,22 @@
13
13
#' @param sample.names A character vector of length equal to \code{samples}, containing the sample names to store in the column metadata of the output object.
14
14
#' If \code{NULL}, the file paths in \code{samples} are used directly.
15
15
#' @param col.names A logical scalar indicating whether the columns of the output object should be named with the cell barcodes.
16
-
#' @param row.names String specifying whether to use Ensembl IDs ("ID") or gene symbols ("Symbol") as row names. If using symbols, the Ensembl ID will be appended to disambiguate in case the same symbol corresponds to multiple Ensembl IDs.
16
+
#' @param row.names String specifying whether to use Ensembl IDs ("ID") or gene symbols ("Symbol") as row names.
17
+
#' For symbols, the Ensembl ID will be appended to disambiguate rows where the same symbol corresponds to multiple Ensembl IDs.
17
18
#' @param type String specifying the type of 10X format to read data from.
18
19
#' @param version String specifying the version of the 10X format to read data from.
19
20
#' @param delayed Logical scalar indicating whether sparse matrices should be wrapped in \linkS4class{DelayedArray}s before combining.
20
21
#' Only relevant for multiple \code{samples}.
21
22
#' @param genome String specifying the genome if \code{type="HDF5"} and \code{version='2'}.
22
-
#' @param compressed Logical scalar indicating whether the text files are compressed for \code{type="sparse"} or \code{"prefix"}.
23
+
#' @param compressed Logical scalar indicating whether the text files are compressed for \code{type="mtx"} or \code{"prefix"}.
23
24
#' @param intersect.genes Logical scalar indicating whether to take the intersection of common genes across all samples.
24
25
#' If \code{FALSE}, differences in gene information across samples will cause an error to be raised.
26
+
#' @param mtx.two.pass Logical scalar indicating whether to use a two-pass approach for loading data from a Matrix Market file.
27
+
#' This reduces peak memory usage at the cost of some additional runtime.
28
+
#' Only relevant when \code{type="mtx"} or \code{type="prefix"}.
29
+
#' @param mtx.class String specifying the class of the output matrix when \code{type="mtx"} or \code{type="prefix"}.
30
+
#' @param mtx.threads Integer scalar specifying the number of threads to use for reading Matrix Market files.
31
+
#' Only relevant when \code{type="mtx"} or \code{type="prefix"}.
25
32
#' @param BPPARAM A \linkS4class{BiocParallelParam} object specifying how loading should be parallelized for multiple \code{samples}.
26
33
#'
27
34
#' @return A \linkS4class{SingleCellExperiment} object containing count data for each gene (row) and cell (column) across all \code{samples}.
@@ -47,34 +54,31 @@
47
54
#' If \code{type="auto"}, the format of the input file is automatically detected for each \code{samples} based on whether it ends with \code{".h5"}.
48
55
#' If so, \code{type} is set to \code{"HDF5"}; otherwise it is set to \code{"sparse"}.
49
56
#' \itemize{
50
-
#' \item If \code{type="sparse"}, count data are loaded as a \linkS4class{dgCMatrix} object.
51
-
#' This is a conventional column-sparse compressed matrix format produced by the CellRanger pipeline,
52
-
#' consisting of a (possibly Gzipped) MatrixMarket text file (\code{"matrix.mtx"})
57
+
#' \item If \code{type="mtx"} (or its older alias \code{"sparse"}), count data are assumed to be stored in a directory.
58
+
#' This should contain a (possibly Gzipped) MatrixMarket text file (\code{"matrix.mtx"})
53
59
#' with additional tab-delimited files for barcodes (\code{"barcodes.tsv"})
54
60
#' and gene annotation (\code{"features.tsv"} for version 3 or \code{"genes.tsv"} for version 2).
55
-
#' \item If \code{type="prefix"}, count data are also loaded as a \linkS4class{dgCMatrix} object.
56
-
#' This assumes the same three-file structure for each sample as described for \code{type="sparse"},
57
-
#' but each sample is defined here by a prefix in the file names rather than by being a separate directory.
58
-
#' For example, if the \code{samples} entry is \code{"xyx_"},
59
-
#' the files are expected to be \code{"xyz_matrix.mtx"}, \code{"xyz_barcodes.tsv"}, etc.
60
-
#' \item If \code{type="HDF5"}, count data are assumed to follow the 10X sparse HDF5 format for large data sets.
61
+
#' \item If \code{type="prefix"}, count data are assumed to follow same three-file structure for each sample as described for \code{type="mtx"}.
62
+
#' However, each sample is defined by a prefix in the file names rather than by being stored a separate directory.
63
+
#' For example, if the \code{samples} entry is \code{"xyx_"}, the files are expected to be \code{"xyz_matrix.mtx"}, \code{"xyz_barcodes.tsv"}, etc.
64
+
#' \item If \code{type="hdf5"} (or its older alias \code{"HDF5"}), count data are assumed to follow the 10X sparse HDF5 format for large data sets.
61
65
#' It is loaded as a \linkS4class{TENxMatrix} object, which is a stub object that refers back to the file in \code{samples}.
62
66
#' Users may need to set \code{genome} if it cannot be automatically determined when \code{version="2"}.
63
67
#' }
64
68
#'
65
-
#' When \code{type="sparse"} or \code{"prefix"} and \code{compressed=NULL},
69
+
#' When \code{type="mtx"} or \code{"prefix"} and \code{compressed=NULL},
66
70
#' the function will automatically search for both the unzipped and Gzipped versions of the files.
67
71
#' This assumes that the compressed files have an additional \code{".gz"} suffix.
68
72
#' We can restrict to only compressed or uncompressed files by setting \code{compressed=TRUE} or \code{FALSE}, respectively.
69
73
#'
70
74
#' CellRanger 3.0 introduced a major change in the format of the output files for both \code{type}s.
71
75
#' If \code{version="auto"}, the version of the format is automatically detected from the supplied paths.
72
-
#' For \code{type="sparse"}, this is based on whether there is a \code{"features.tsv.gz"} file in the directory.
76
+
#' For \code{type="mtx"}, this is based on whether there is a \code{"features.tsv.gz"} file in the directory.
73
77
#' For \code{type="HDF5"}, this is based on whether there is a top-level \code{"matrix"} group with a \code{"matrix/features"} subgroup in the file.
74
78
#'
75
79
#' Matrices are combined by column if multiple \code{samples} were specified.
76
80
#' This will throw an error if the gene information is not consistent across \code{samples}.
77
-
#' For \code{type="sparse"} or \code{"prefix"}, users can set \code{delayed=TRUE} to save memory during the combining process.
81
+
#' For \code{type="mtx"} or \code{"prefix"}, users can set \code{delayed=TRUE} to save memory during the combining process.
78
82
#' This also avoids integer overflow for very large datasets.
79
83
#'
80
84
#' If \code{col.names=TRUE} and \code{length(sample)==1}, each column is named by the cell barcode.
#' Strings containing metadata attributes to be added to the HDF5 file for \code{type="HDF5"}.
21
+
#' Strings containing metadata attributes to be added to the HDF5 file for \code{type="hdf5"}.
22
22
#' Their interpretation is not formally documented and is left to the user's imagination.
23
23
#'
24
24
#' @details
25
25
#' This function will try to automatically detect the desired format based on whether \code{path} ends with \code{".h5"}.
26
-
#' If so, it assumes that \code{path} specifies a HDF5 file path and sets \code{type="HDF5"}.
27
-
#' Otherwise it will set \code{type="sparse"} under the assumption that \code{path} specifies a path to a directory.
26
+
#' If so, it assumes that \code{path} specifies a HDF5 file path and sets \code{type="hdf5"}.
27
+
#' Otherwise it will set \code{type="mtx"} under the assumption that \code{path} specifies a path to a directory.
28
28
#'
29
29
#' Note that there were major changes in the output format for CellRanger version 3.0 to account for non-gene features such as antibody or CRISPR tags.
30
30
#' Users can switch to this new format using \code{version="3"}.
@@ -35,11 +35,11 @@
35
35
#' We recommend against doing so routinely due to CellRanger's dependence on undocumented metadata attributes that may change without notice.
36
36
#'
37
37
#' @return
38
-
#' For \code{type="sparse"}, a directory is produced at \code{path}.
38
+
#' For \code{type="mtx"}, a directory is produced at \code{path}.
39
39
#' If \code{version="2"}, this will contain the files \code{"matrix.mtx"}, \code{"barcodes.tsv"} and \code{"genes.tsv"}.
40
40
#' If \code{version="3"}, it will instead contain \code{"matrix.mtx.gz"}, \code{"barcodes.tsv.gz"} and \code{"features.tsv.gz"}.
41
41
#'
42
-
#' For \code{type="HDF5"}, a HDF5 file is produced at \code{path} containing data in column-sparse format.
42
+
#' For \code{type="hdf5"}, a HDF5 file is produced at \code{path} containing data in column-sparse format.
43
43
#' If \code{version="2"}, data are stored in the HDF5 group named \code{genome}.
44
44
#' If \code{version="3"}, data are stored in the group \code{"matrix"}.
0 commit comments