Set alpha=Inf as the default for testEmptyDrops. #118

LTLA · 2024-12-10T22:14:21Z

This is because a finite alpha is not universally safer; despite representing an overdispersed multinomial, using it to compute p-values for a multinomial-distributed vector will actually be anticonservative!

This means that an inaccurate alpha cannot be brushed under the carpet. Previously, I was trusting that any finite alpha would be safer than alpha=Inf, but it seems that this is not true. We can't just ignore poor estimates of alpha that we get from the assumed-ambient counts.

So, we just keep it simple and default alpha=Inf, which is exactly correct for the multinomial case (which should hopefully handle most use cases). This also reduces runtime and aligns with the CellRanger defaults.

This is because a finite alpha is not universally safer; despite representing an overdispersed multinomial, using it to compute p-values for a multinomial-distributed vector will actually be anticonservative! This means that an inaccurate alpha cannot be brushed under the carpet. Previously, I was trusting that any finite alpha would be safer than alpha=Inf, but it seems that this is not true. We can't just ignore poor estimates of alpha that we get from the assumed-ambient counts. So, we just keep it simple and default alpha=Inf, which is exactly correct for the multinomial case (which should hopefully be most cases). This also reduces runtime and aligns with the CellRanger defaults.

LTLA · 2024-12-10T22:46:04Z

To demonstrate:

alpha <- 2000 # assumed "incorrect" alpha
nbarcodes <- 1000
ngenes <- 10000

# Simulating multinomial draws and computing "incorrect" log-likelihoods under a Dirichlet multinomial.
total <- 800
ambient <- runif(ngenes)
ambient <- ambient / sum(ambient)

am.alpha <- ambient * alpha
pm <- numeric(1000)
for (it in seq_along(pm)) {
    sim <- rmultinom(1, total, prob=ambient)
    # Only considering the data-dependent part of the PMF, for simplicity.
    pm[it] <- sum(lgamma(sim + am.alpha) - lfactorial(sim) - lgamma(am.alpha)) 
}

# Performing Monte-Carlo draws from a Dirichlet-multinomial and computing their log-likelihoods.
sim.probs <- numeric(1000)
for (it in seq_along(sim.probs)) {
    sim.p <- rgamma(ngenes, shape=am.alpha, rate=1)
    sim.x <- rmultinom(1, total, prob=sim.p)
    sim.probs[it] <- sum(lgamma(sim.x + am.alpha) - lfactorial(sim.x) - lgamma(am.alpha))
}

boxplot(list(Observed=pm, Expected=sim.probs), ylab="Dirichlet-multinomial log-likelihood")

So we can see that the expected log-likelihoods are always higher than the observed ones, resulting in Monte-Carlo p-values of zero that are obviously anticonservative. This is unexpected as overstating the variance usually results in higher p-values when testing for deviation, which is why I thought that alpha < Inf would be generally safe in the first place.

The underlying reason is because the DM at low alpha expects the count vector to be sparse, as probability mass is concentrated in a subset of the entries after the gamma sampling. However, the pure multinomial yields denser count vectors, which is unusual from a DM perspective. This results in a lower log-likelihood as observed above.

lcolladotor · 2024-12-11T17:13:25Z

Thanks Aaron @LTLA!

Does this mean that DropletUtils::emptyDrops() now fully matches the CellRanger parameters for EmptyDrops? Or are there still some differences between the two?

In other words, since CellRanger both OrdMag and EmptyDrops as noted at https://www.10xgenomics.com/support/software/cell-ranger/latest/algorithms-overview/cr-gex-algorithm, is now the only difference the OrdMag part when analyzing the filtered CellRanger outputs versus the raw outputs in R and filtered using DropletUtils::emptyDrops()?

Thanks!

Best,
Leo

LTLA · 2024-12-11T17:20:56Z

There are still a few differences between the algorithms, which are currently described in ?emptyDropsCellRanger. This change eliminates one of the undocumented differences.

lcolladotor · 2024-12-11T18:00:03Z

Thanks Aaron!

- Details of changes: MarioniLab/DropletUtils#118 - Results in slightly different set of non-empty droplets, which causes knock-on changes to the cluster labels

- Details of changes: MarioniLab/DropletUtils#118 - Since the default is now `alpha=Inf`, the claims around `alpha` values and sanity check no longer make sense, so just remove them.

- Details of changes: MarioniLab/DropletUtils#118 - For this chunk, the new default of `alpha=Inf` gives poor results. Some quick comparisons given below - BioC 3.21 - 39,028 non-empty droplets with `emptyDrops(counts(hto.sce), by.rank=40000)` - 21,374 non-empty droplets with `emptyDrops(counts(hto.sce), by.rank=40000, alpha=NULL) - 29,673 non-empty droplets with `emptyDrops(counts(hto.sce), by.rank=30000)` - BioC 3.20 - 21780 on-empty droplets with `emptyDrops(counts(hto.sce), by.rank=40000) (i.e. former default of `alpha = NULL`) - Added some explanation/justification cribbed from that removed in 76a88c5

- Knock-on effect of changes to `tenx-unfiltered-pbmc4k.Rmd` in **OSCA.workflows** (OSCA-source/OSCA.workflows@6ab1fc4) - Details of changes: MarioniLab/DropletUtils#118 - Results in slightly different set of non-empty droplets which causes knock-on changes to the cluster labels

- Knock-on effect of changes to `tenx-unfiltered-pbmc4k.Rmd` in **OSCA.workflows** (OSCA-source/OSCA.workflows@6ab1fc4) - Details of changes: MarioniLab/DropletUtils#118 - Results in slightly different set of non-empty droplets which causes knock-on changes to the cluster label

LTLA merged commit e33763c into devel Dec 10, 2024
1 check passed

LTLA deleted the alpha_inf branch December 10, 2024 22:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set alpha=Inf as the default for testEmptyDrops. #118

Set alpha=Inf as the default for testEmptyDrops. #118

Uh oh!

LTLA commented Dec 10, 2024

Uh oh!

LTLA commented Dec 10, 2024 •

edited

Loading

Uh oh!

Uh oh!

lcolladotor commented Dec 11, 2024

Uh oh!

LTLA commented Dec 11, 2024

Uh oh!

lcolladotor commented Dec 11, 2024

Uh oh!

Uh oh!

Set alpha=Inf as the default for testEmptyDrops. #118

Set alpha=Inf as the default for testEmptyDrops. #118

Uh oh!

Conversation

LTLA commented Dec 10, 2024

Uh oh!

LTLA commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lcolladotor commented Dec 11, 2024

Uh oh!

LTLA commented Dec 11, 2024

Uh oh!

lcolladotor commented Dec 11, 2024

Uh oh!

Uh oh!

LTLA commented Dec 10, 2024 •

edited

Loading