Skip to content

[r] add iterative lsi design docs #167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: design-docs
Choose a base branch
from

Conversation

immanuelazn
Copy link
Collaborator

@immanuelazn immanuelazn commented Dec 12, 2024

Pretty much as we discussed during call!

I think the biggest point of contention is the normalization structure. Normalizations like tf-idf, Z-score norms can have data that they will be fit to. However, the biggest problem is that we want something that can interoperate with BPCells operations, while also return the calculated information (mean, variance, idf). Should it follow the same styling as the S3 class for LSI that we are creating, with cell.embeddings/feature.loadings?

I propose just having a boolean param, with the default returning an IterableMatrix, and the other being an option to return a class that we can project with.

@immanuelazn immanuelazn requested a review from bnprks December 12, 2024 00:11
@bnprks
Copy link
Owner

bnprks commented Dec 12, 2024

Thanks for this update, very clearly written

Discussion points:

  • New idea for the base name, maybe select_features_[method] instead of variable_features_[method]? I think this might side-step some confusion that some selection methods won't be based on variance, and has the benefit of actually using a verb. Curious what you think

Comments:

  • I think mean might be a better general name than _by_top_accessibility -- I think the calculation would be basically the same
  • for _by_cluster_variance, I thought we discussed that we might not make this helper method layer, and just do the pseudobulking in the iterative lsi function
    • That said, we could probably use functions for _variance and _dispersion (variance divided by mean)
  • As discussed on slack, I think normalize_tfidf should have a feature_means or feature_sums argument. We won't return any extra data and rely on the clarity of the argument name for pepole to know
  • I think we can ditch feature.loadings for now from the Seurat interface, and maybe rename cell.embeddings to cell_embeddings for better stylistic consistency within BPCells. We can add feature.loadings back in later if we decide it would be useful

@immanuelazn
Copy link
Collaborator Author

immanuelazn commented Dec 12, 2024

select_features is good!

For your second point comment point, I think my interpretation from that conversation was that the clustering would be separate, but we would still pseudobulk within funciton. By all means, we can take that out and put it into the wrapper iterative lsi function though! Do you think it would make sense to do variance and dispersion as just a parameter in the same function instead, rather than separate functions?

Overall I agree with your other points. Will reflect here soon!

@bnprks
Copy link
Owner

bnprks commented Dec 12, 2024

I was originally thinking clustering + pseudobulk calculation could happen in iterative_lsi. Then we still have a parameter in iterative_lsi to let the user configure how feature selection happens (which could be e.g. select_features_variance(normalize=normalize_log) by default). Then the two functions I mentioned would just be general options for variable feature selection, implemented outside of iterative lsi. I think having at least a select_features_variance function separate is a good idea; still OK to have a select_features_by_pseudobulk method if you prefer to implement it that way -- not a big deal either way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants