-
Notifications
You must be signed in to change notification settings - Fork 28
Set alpha=Inf as the default for testEmptyDrops. #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is because a finite alpha is not universally safer; despite representing an overdispersed multinomial, using it to compute p-values for a multinomial-distributed vector will actually be anticonservative! This means that an inaccurate alpha cannot be brushed under the carpet. Previously, I was trusting that any finite alpha would be safer than alpha=Inf, but it seems that this is not true. We can't just ignore poor estimates of alpha that we get from the assumed-ambient counts. So, we just keep it simple and default alpha=Inf, which is exactly correct for the multinomial case (which should hopefully be most cases). This also reduces runtime and aligns with the CellRanger defaults.
To demonstrate: alpha <- 2000 # assumed "incorrect" alpha
nbarcodes <- 1000
ngenes <- 10000
# Simulating multinomial draws and computing "incorrect" log-likelihoods under a Dirichlet multinomial.
total <- 800
ambient <- runif(ngenes)
ambient <- ambient / sum(ambient)
am.alpha <- ambient * alpha
pm <- numeric(1000)
for (it in seq_along(pm)) {
sim <- rmultinom(1, total, prob=ambient)
# Only considering the data-dependent part of the PMF, for simplicity.
pm[it] <- sum(lgamma(sim + am.alpha) - lfactorial(sim) - lgamma(am.alpha))
}
# Performing Monte-Carlo draws from a Dirichlet-multinomial and computing their log-likelihoods.
sim.probs <- numeric(1000)
for (it in seq_along(sim.probs)) {
sim.p <- rgamma(ngenes, shape=am.alpha, rate=1)
sim.x <- rmultinom(1, total, prob=sim.p)
sim.probs[it] <- sum(lgamma(sim.x + am.alpha) - lfactorial(sim.x) - lgamma(am.alpha))
}
boxplot(list(Observed=pm, Expected=sim.probs), ylab="Dirichlet-multinomial log-likelihood") So we can see that the expected log-likelihoods are always higher than the observed ones, resulting in Monte-Carlo p-values of zero that are obviously anticonservative. This is unexpected as overstating the variance usually results in higher p-values when testing for deviation, which is why I thought that The underlying reason is because the DM at low alpha expects the count vector to be sparse, as probability mass is concentrated in a subset of the entries after the gamma sampling. However, the pure multinomial yields denser count vectors, which is unusual from a DM perspective. This results in a lower log-likelihood as observed above. |
Thanks Aaron @LTLA! Does this mean that In other words, since Thanks! Best, |
There are still a few differences between the algorithms, which are currently described in |
Thanks Aaron! |
- Details of changes: MarioniLab/DropletUtils#118 - Results in slightly different set of non-empty droplets, which causes knock-on changes to the cluster labels
- Details of changes: MarioniLab/DropletUtils#118 - Since the default is now `alpha=Inf`, the claims around `alpha` values and sanity check no longer make sense, so just remove them.
- Details of changes: MarioniLab/DropletUtils#118 - For this chunk, the new default of `alpha=Inf` gives poor results. Some quick comparisons given below - BioC 3.21 - 39,028 non-empty droplets with `emptyDrops(counts(hto.sce), by.rank=40000)` - 21,374 non-empty droplets with `emptyDrops(counts(hto.sce), by.rank=40000, alpha=NULL) - 29,673 non-empty droplets with `emptyDrops(counts(hto.sce), by.rank=30000)` - BioC 3.20 - 21780 on-empty droplets with `emptyDrops(counts(hto.sce), by.rank=40000) (i.e. former default of `alpha = NULL`) - Added some explanation/justification cribbed from that removed in 76a88c5
- Knock-on effect of changes to `tenx-unfiltered-pbmc4k.Rmd` in **OSCA.workflows** (OSCA-source/OSCA.workflows@6ab1fc4) - Details of changes: MarioniLab/DropletUtils#118 - Results in slightly different set of non-empty droplets which causes knock-on changes to the cluster labels
- Knock-on effect of changes to `tenx-unfiltered-pbmc4k.Rmd` in **OSCA.workflows** (OSCA-source/OSCA.workflows@6ab1fc4) - Details of changes: MarioniLab/DropletUtils#118 - Results in slightly different set of non-empty droplets which causes knock-on changes to the cluster label
This is because a finite alpha is not universally safer; despite representing an overdispersed multinomial, using it to compute p-values for a multinomial-distributed vector will actually be anticonservative!
This means that an inaccurate alpha cannot be brushed under the carpet. Previously, I was trusting that any finite alpha would be safer than alpha=Inf, but it seems that this is not true. We can't just ignore poor estimates of alpha that we get from the assumed-ambient counts.
So, we just keep it simple and default alpha=Inf, which is exactly correct for the multinomial case (which should hopefully handle most use cases). This also reduces runtime and aligns with the CellRanger defaults.