Skip to content

Commit 407cf2a

Browse files
authored
Improve the stability of the knee point identification. (#117)
The new algorithm is based on maximizing the distance from a line between the plateau and the inflection point, which avoids problems with the instability of the empirical second derivative, even with smoothing. Also did a minor fix to action to avoid redundant runs on PR.
1 parent f8b7ceb commit 407cf2a

File tree

7 files changed

+32
-61
lines changed

7 files changed

+32
-61
lines changed

.github/workflows/check-bioc.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@
2222

2323
on:
2424
push:
25-
branch:
26-
- 'devel'
25+
branches:
26+
- devel
2727
pull_request:
2828

2929
name: R-CMD-check-bioc

DESCRIPTION

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Package: DropletUtils
2-
Version: 1.27.0
3-
Date: 2024-07-23
2+
Version: 1.27.1
3+
Date: 2024-12-04
44
Title: Utilities for Handling Single-Cell Droplet Data
55
Authors@R: c(
66
person("Aaron", "Lun", role = "aut"),

NAMESPACE

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -118,22 +118,19 @@ importFrom(scuttle,isOutlier)
118118
importFrom(scuttle,librarySizeFactors)
119119
importFrom(scuttle,sumCountsAcrossCells)
120120
importFrom(scuttle,sumCountsAcrossFeatures)
121-
importFrom(stats,fitted)
122121
importFrom(stats,kmeans)
123122
importFrom(stats,mad)
124123
importFrom(stats,median)
125124
importFrom(stats,optimize)
126125
importFrom(stats,p.adjust)
127126
importFrom(stats,pnbinom)
128127
importFrom(stats,ppois)
129-
importFrom(stats,predict)
130128
importFrom(stats,qnbinom)
131129
importFrom(stats,quantile)
132130
importFrom(stats,rexp)
133131
importFrom(stats,rgamma)
134132
importFrom(stats,rpois)
135133
importFrom(stats,runif)
136-
importFrom(stats,smooth.spline)
137134
importFrom(utils,head)
138135
importFrom(utils,read.delim)
139136
importFrom(utils,tail)

R/barcodeRanks.R

Lines changed: 16 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
#' @param exclude.from An integer scalar specifying the number of highest ranking barcodes to exclude from spline fitting.
1212
#' Ignored if \code{fit.bounds} is specified.
1313
#' @param assay.type Integer or string specifying the assay containing the count matrix.
14-
#' @param df Integer scalar specifying the number of degrees of freedom, to pass to \code{\link{smooth.spline}}.
14+
#' @param df Deprecated and ignored.
1515
#' @param ... For the generic, further arguments to pass to individual methods.
1616
#'
1717
#' For the SummarizedExperiment method, further arguments to pass to the ANY method.
@@ -34,20 +34,11 @@
3434
#' \item The inflection point is computed as the point on the rank/total curve where the first derivative is minimized.
3535
#' The derivative is computed directly from all points on the curve with total counts greater than \code{lower}.
3636
#' This avoids issues with erratic behaviour of the curve at lower totals.
37-
#' \item The knee point is defined as the point on the curve where the signed curvature is minimized.
38-
#' This requires calculation of the second derivative, which is much more sensitive to noise in the curve.
39-
#' To overcome this, a smooth spline is fitted to the log-total counts against the log-rank using \code{\link{smooth.spline}}.
40-
#' Derivatives are then calculated from the fitted spline using \code{\link{predict}}.
37+
#' \item The knee point is defined as the point on the curve that is furthest from the straight line drawn between the \code{fit.bounds} locations on the curve.
38+
#' We used to minimize the signed curvature to identify the knee point but this relies on the second derivative,
39+
#' which was too unstable even after smoothing.
4140
#' }
4241
#'
43-
#' @section Details on curve fitting:
44-
#' We supply a relatively low default setting of \code{df} to avoid overfitting the spline,
45-
#' as this results in unstability in the higher derivatives (and thus the curvature).
46-
#' \code{df} and other arguments to \code{\link{smooth.spline}} can be tuned
47-
#' if the estimated knee point is not at an appropriate location.
48-
#' We also restrict the fit to lie within the bounds defined by \code{fit.bounds} to focus on the region containing the knee point.
49-
#' This allows us to obtain an accurate fit with low \code{df} rather than attempting to model the entire curve.
50-
#'
5142
#' If \code{fit.bounds} is not specified, the lower bound is automatically set to the inflection point
5243
#' as this should lie below the knee point on typical curves.
5344
#' The upper bound is set to the point at which the first derivative is closest to zero,
@@ -63,8 +54,6 @@
6354
#' \describe{
6455
#' \item{\code{rank}:}{Numeric, the rank of each barcode (averaged across ties).}
6556
#' \item{\code{total}:}{Numeric, the total counts for each barcode.}
66-
#' \item{\code{fitted}:}{Numeric, the fitted value from the spline for each barcode.
67-
#' This is \code{NA} for points with \code{x} outside of \code{fit.bounds}.}
6857
#' }
6958
#'
7059
#' The metadata contains \code{knee}, a numeric scalar containing the total count at the knee point;
@@ -85,7 +74,6 @@
8574
#' # Making a plot.
8675
#' plot(br.out$rank, br.out$total, log="xy", xlab="Rank", ylab="Total")
8776
#' o <- order(br.out$rank)
88-
#' lines(br.out$rank[o], br.out$fitted[o], col="red")
8977
#' abline(h=metadata(br.out)$knee, col="dodgerblue", lty=2)
9078
#' abline(h=metadata(br.out)$inflection, col="forestgreen", lty=2)
9179
#' legend("bottomleft", lty=2, col=c("dodgerblue", "forestgreen"),
@@ -98,7 +86,6 @@
9886
#' @name barcodeRanks
9987
NULL
10088

101-
#' @importFrom stats smooth.spline predict fitted
10289
#' @importFrom utils tail
10390
#' @importFrom Matrix colSums
10491
#' @importFrom S4Vectors DataFrame metadata<-
@@ -134,21 +121,20 @@ NULL
134121
if (is.null(fit.bounds)) {
135122
new.keep <- left.edge:right.edge
136123
} else {
137-
new.keep <- y > log10(fit.bounds[1]) & y < log10(fit.bounds[2])
124+
new.keep <- which(y > log10(fit.bounds[1]) & y < log10(fit.bounds[2]))
138125
}
139126

140-
# Smoothing to avoid error multiplication upon differentiation.
141-
# Minimizing the signed curvature and returning the total for the knee point.
142-
fitted.vals <- rep(NA_real_, length(keep))
143-
127+
# Using the maximum distance to identify the knee point.
144128
if (length(new.keep) >= 4) {
145-
fit <- smooth.spline(x[new.keep], y[new.keep], df=df, ...)
146-
fitted.vals[keep][new.keep] <- 10^fitted(fit)
147-
148-
d1 <- predict(fit, deriv=1)$y
149-
d2 <- predict(fit, deriv=2)$y
150-
curvature <- d2/(1 + d1^2)^1.5
151-
knee <- 10^(y[new.keep][which.min(curvature)])
129+
curx <- x[new.keep]
130+
cury <- y[new.keep]
131+
xbounds <- curx[c(1L, length(new.keep))]
132+
ybounds <- cury[c(1L, length(new.keep))]
133+
gradient <- diff(ybounds)/diff(xbounds)
134+
intercept <- ybounds[1] - xbounds[1] * gradient
135+
above <- which(cury >= curx * gradient + intercept)
136+
dist <- abs(gradient * curx[above] - cury[above] + intercept)/sqrt(gradient^2 + 1)
137+
knee <- 10^(cury[above[which.max(dist)]])
152138
} else {
153139
# Sane fallback upon overly aggressive filtering by 'exclude.from', 'lower'.
154140
knee <- 10^(y[new.keep[1]])
@@ -157,8 +143,7 @@ NULL
157143
# Returning a whole stack of useful stats.
158144
out <- DataFrame(
159145
rank=.reorder(run.rank, stuff$lengths, o),
160-
total=.reorder(run.totals, stuff$lengths, o),
161-
fitted=.reorder(fitted.vals, stuff$lengths, o)
146+
total=.reorder(run.totals, stuff$lengths, o)
162147
)
163148
rownames(out) <- colnames(m)
164149
metadata(out) <- list(knee=knee, inflection=inflection)

inst/NEWS.Rd

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22
\title{DropletUtils News}
33
\encoding{UTF-8}
44

5+
\section{Version 1.28.0}{\itemize{
6+
\item Use a more stable algorithm for identifying the knee point in \code{barcodeRanks()}.
7+
The new algorithm is based on maximizing the distance from a line between the plateau and the inflection point.
8+
Previously, we tried to minimize the signed curvature but this was susceptible to many local minima due to the instability of the empirical second derivative, even after smoothing.
9+
}}
10+
511
\section{Version 1.18.0}{\itemize{
612
\item Added an \code{intersect.genes=} option to \code{read10xCounts()} for samples with inconsistent gene information.
713
Automatically fix empty chromosome names for mitochondrial genes in certain Cellranger outputs.

man/barcodeRanks.Rd

Lines changed: 4 additions & 18 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

tests/testthat/test-misc.R

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@ test_that("barcodeRanks runs to completion", {
1111
brout <- barcodeRanks(my.counts, lower=limit)
1212
expect_equal(brout$total, totals)
1313
expect_identical(brout$rank, rank(-totals, ties.method="average"))
14-
expect_true(all(is.na(brout$fitted[totals <= limit])))
1514

1615
# Trying again with a higher limit.
1716
limit2 <- 200
@@ -21,9 +20,8 @@ test_that("barcodeRanks runs to completion", {
2120
# Specifying the boundaries.
2221
bounds <- c(200, 1000)
2322
brout3 <- barcodeRanks(my.counts, lower=limit, fit.bounds=bounds)
24-
is.okay <- totals > bounds[1] & totals < bounds[2]
25-
expect_true(all(is.na(brout3$fitted[!is.okay])))
26-
expect_true(all(!is.na(brout3$fitted[is.okay])))
23+
knee <- metadata(brout3)$knee
24+
expect_true(knee >= bounds[1] && knee <= bounds[2])
2725

2826
# Respecting column names.
2927
alt <- my.counts
@@ -32,7 +30,6 @@ test_that("barcodeRanks runs to completion", {
3230
expect_identical(rownames(brout2), colnames(alt))
3331
expect_identical(names(brout2$rank), NULL)
3432
expect_identical(names(brout2$total), NULL)
35-
expect_identical(names(brout2$fitted), NULL)
3633

3734
# Trying out silly inputs.
3835
expect_error(barcodeRanks(my.counts[,0]), "insufficient")

0 commit comments

Comments
 (0)