Skip to content

Improve the stability of the knee point identification. #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/check-bioc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@

on:
push:
branch:
- 'devel'
branches:
- devel
pull_request:

name: R-CMD-check-bioc
Expand Down
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: DropletUtils
Version: 1.27.0
Date: 2024-07-23
Version: 1.27.1
Date: 2024-12-04
Title: Utilities for Handling Single-Cell Droplet Data
Authors@R: c(
person("Aaron", "Lun", role = "aut"),
Expand Down
3 changes: 0 additions & 3 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -118,22 +118,19 @@ importFrom(scuttle,isOutlier)
importFrom(scuttle,librarySizeFactors)
importFrom(scuttle,sumCountsAcrossCells)
importFrom(scuttle,sumCountsAcrossFeatures)
importFrom(stats,fitted)
importFrom(stats,kmeans)
importFrom(stats,mad)
importFrom(stats,median)
importFrom(stats,optimize)
importFrom(stats,p.adjust)
importFrom(stats,pnbinom)
importFrom(stats,ppois)
importFrom(stats,predict)
importFrom(stats,qnbinom)
importFrom(stats,quantile)
importFrom(stats,rexp)
importFrom(stats,rgamma)
importFrom(stats,rpois)
importFrom(stats,runif)
importFrom(stats,smooth.spline)
importFrom(utils,head)
importFrom(utils,read.delim)
importFrom(utils,tail)
Expand Down
47 changes: 16 additions & 31 deletions R/barcodeRanks.R
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
#' @param exclude.from An integer scalar specifying the number of highest ranking barcodes to exclude from spline fitting.
#' Ignored if \code{fit.bounds} is specified.
#' @param assay.type Integer or string specifying the assay containing the count matrix.
#' @param df Integer scalar specifying the number of degrees of freedom, to pass to \code{\link{smooth.spline}}.
#' @param df Deprecated and ignored.
#' @param ... For the generic, further arguments to pass to individual methods.
#'
#' For the SummarizedExperiment method, further arguments to pass to the ANY method.
Expand All @@ -34,20 +34,11 @@
#' \item The inflection point is computed as the point on the rank/total curve where the first derivative is minimized.
#' The derivative is computed directly from all points on the curve with total counts greater than \code{lower}.
#' This avoids issues with erratic behaviour of the curve at lower totals.
#' \item The knee point is defined as the point on the curve where the signed curvature is minimized.
#' This requires calculation of the second derivative, which is much more sensitive to noise in the curve.
#' To overcome this, a smooth spline is fitted to the log-total counts against the log-rank using \code{\link{smooth.spline}}.
#' Derivatives are then calculated from the fitted spline using \code{\link{predict}}.
#' \item The knee point is defined as the point on the curve that is furthest from the straight line drawn between the \code{fit.bounds} locations on the curve.
#' We used to minimize the signed curvature to identify the knee point but this relies on the second derivative,
#' which was too unstable even after smoothing.
#' }
#'
#' @section Details on curve fitting:
#' We supply a relatively low default setting of \code{df} to avoid overfitting the spline,
#' as this results in unstability in the higher derivatives (and thus the curvature).
#' \code{df} and other arguments to \code{\link{smooth.spline}} can be tuned
#' if the estimated knee point is not at an appropriate location.
#' We also restrict the fit to lie within the bounds defined by \code{fit.bounds} to focus on the region containing the knee point.
#' This allows us to obtain an accurate fit with low \code{df} rather than attempting to model the entire curve.
#'
#' If \code{fit.bounds} is not specified, the lower bound is automatically set to the inflection point
#' as this should lie below the knee point on typical curves.
#' The upper bound is set to the point at which the first derivative is closest to zero,
Expand All @@ -63,8 +54,6 @@
#' \describe{
#' \item{\code{rank}:}{Numeric, the rank of each barcode (averaged across ties).}
#' \item{\code{total}:}{Numeric, the total counts for each barcode.}
#' \item{\code{fitted}:}{Numeric, the fitted value from the spline for each barcode.
#' This is \code{NA} for points with \code{x} outside of \code{fit.bounds}.}
#' }
#'
#' The metadata contains \code{knee}, a numeric scalar containing the total count at the knee point;
Expand All @@ -85,7 +74,6 @@
#' # Making a plot.
#' plot(br.out$rank, br.out$total, log="xy", xlab="Rank", ylab="Total")
#' o <- order(br.out$rank)
#' lines(br.out$rank[o], br.out$fitted[o], col="red")
#' abline(h=metadata(br.out)$knee, col="dodgerblue", lty=2)
#' abline(h=metadata(br.out)$inflection, col="forestgreen", lty=2)
#' legend("bottomleft", lty=2, col=c("dodgerblue", "forestgreen"),
Expand All @@ -98,7 +86,6 @@
#' @name barcodeRanks
NULL

#' @importFrom stats smooth.spline predict fitted
#' @importFrom utils tail
#' @importFrom Matrix colSums
#' @importFrom S4Vectors DataFrame metadata<-
Expand Down Expand Up @@ -134,21 +121,20 @@ NULL
if (is.null(fit.bounds)) {
new.keep <- left.edge:right.edge
} else {
new.keep <- y > log10(fit.bounds[1]) & y < log10(fit.bounds[2])
new.keep <- which(y > log10(fit.bounds[1]) & y < log10(fit.bounds[2]))
}

# Smoothing to avoid error multiplication upon differentiation.
# Minimizing the signed curvature and returning the total for the knee point.
fitted.vals <- rep(NA_real_, length(keep))

# Using the maximum distance to identify the knee point.
if (length(new.keep) >= 4) {
fit <- smooth.spline(x[new.keep], y[new.keep], df=df, ...)
fitted.vals[keep][new.keep] <- 10^fitted(fit)

d1 <- predict(fit, deriv=1)$y
d2 <- predict(fit, deriv=2)$y
curvature <- d2/(1 + d1^2)^1.5
knee <- 10^(y[new.keep][which.min(curvature)])
curx <- x[new.keep]
cury <- y[new.keep]
xbounds <- curx[c(1L, length(new.keep))]
ybounds <- cury[c(1L, length(new.keep))]
gradient <- diff(ybounds)/diff(xbounds)
intercept <- ybounds[1] - xbounds[1] * gradient
above <- which(cury >= curx * gradient + intercept)
dist <- abs(gradient * curx[above] - cury[above] + intercept)/sqrt(gradient^2 + 1)
knee <- 10^(cury[above[which.max(dist)]])
} else {
# Sane fallback upon overly aggressive filtering by 'exclude.from', 'lower'.
knee <- 10^(y[new.keep[1]])
Expand All @@ -157,8 +143,7 @@ NULL
# Returning a whole stack of useful stats.
out <- DataFrame(
rank=.reorder(run.rank, stuff$lengths, o),
total=.reorder(run.totals, stuff$lengths, o),
fitted=.reorder(fitted.vals, stuff$lengths, o)
total=.reorder(run.totals, stuff$lengths, o)
)
rownames(out) <- colnames(m)
metadata(out) <- list(knee=knee, inflection=inflection)
Expand Down
6 changes: 6 additions & 0 deletions inst/NEWS.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@
\title{DropletUtils News}
\encoding{UTF-8}

\section{Version 1.28.0}{\itemize{
\item Use a more stable algorithm for identifying the knee point in \code{barcodeRanks()}.
The new algorithm is based on maximizing the distance from a line between the plateau and the inflection point.
Previously, we tried to minimize the signed curvature but this was susceptible to many local minima due to the instability of the empirical second derivative, even after smoothing.
}}

\section{Version 1.18.0}{\itemize{
\item Added an \code{intersect.genes=} option to \code{read10xCounts()} for samples with inconsistent gene information.
Automatically fix empty chromosome names for mitochondrial genes in certain Cellranger outputs.
Expand Down
22 changes: 4 additions & 18 deletions man/barcodeRanks.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 2 additions & 5 deletions tests/testthat/test-misc.R
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ test_that("barcodeRanks runs to completion", {
brout <- barcodeRanks(my.counts, lower=limit)
expect_equal(brout$total, totals)
expect_identical(brout$rank, rank(-totals, ties.method="average"))
expect_true(all(is.na(brout$fitted[totals <= limit])))

# Trying again with a higher limit.
limit2 <- 200
Expand All @@ -21,9 +20,8 @@ test_that("barcodeRanks runs to completion", {
# Specifying the boundaries.
bounds <- c(200, 1000)
brout3 <- barcodeRanks(my.counts, lower=limit, fit.bounds=bounds)
is.okay <- totals > bounds[1] & totals < bounds[2]
expect_true(all(is.na(brout3$fitted[!is.okay])))
expect_true(all(!is.na(brout3$fitted[is.okay])))
knee <- metadata(brout3)$knee
expect_true(knee >= bounds[1] && knee <= bounds[2])

# Respecting column names.
alt <- my.counts
Expand All @@ -32,7 +30,6 @@ test_that("barcodeRanks runs to completion", {
expect_identical(rownames(brout2), colnames(alt))
expect_identical(names(brout2$rank), NULL)
expect_identical(names(brout2$total), NULL)
expect_identical(names(brout2$fitted), NULL)

# Trying out silly inputs.
expect_error(barcodeRanks(my.counts[,0]), "insufficient")
Expand Down
Loading