-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
TODOs in the actual GWAS script:
- Regress on complete cases only
- This is not explicit in the Neale Lab code but it is implicit in the
linreg3
call at https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/imputed-v2-gwas/7_run_linreg3.py#L63 and https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/0.1/22.run_regressions.py#L92 - From the linreg3 docs:
linreg3() uses the same set of samples for each phenotype, namely the set of samples for which all phenotypes and covariates are defined.
- This is not explicit in the Neale Lab code but it is implicit in the
- Add interaction terms in covariates
- These were added as they are in https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/0.1/22.run_regressions.py#L71
- Determine how many PCs to use
- 20 are used in https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/0.1/22.run_regressions.py#L71
- The Details and Considerations post (which pertains to the earlier results from the code in
imputed-v2-gwas
) states:
To simplify the process of association testing, association for all phenotypes used a least-squares linear model predicting the phenotype with an additive genotype coding (0, 1, or 2 copies of the minor allele), with sex and the first 10 principal components from the UK Biobank sample QC file as covariates.
- Conclusion: 20 PCs will be used
- Determine which sex + age fields we should be using (is it genetic sex or one of the others?)
- Age is set here https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/0.1/00.load_sample_qc_kt.py#L61 as field 21022 (age at recruitment).
- It is not entirely clear where this comes from in https://github.com/Nealelab/UK_Biobank_GWAS/blob/67289386a851a213f7bb470a3f0f6af95933b041/0.1/00.load_sample_qc_kt.py#L14, but presumably this means the inferred genetic sex field 22001
- Change input to intermediate results following Store post-QC bgen zarr archives for analysis #9
- With https://github.com/pystatgen/sgkit/pull/391 in, add beta values to results and make sure that full computation does not run twice for both that and p values
- Decide whether or not to attempt sample QC filter for "Use 7 standard deviations away from the 1st 6 PCs"
- Add separate MAF threshold for coding variants
- Add tolerance to argmax used in hard call logic (make ties NA)
- Group phenotypes to regress together (see https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/imputed-v2-gwas/4_build_pipelines.py#L27)
- PHESANT output columns might look like 6153_1, 6153_2, 6153_3, ..., 6153_100, 4620 so it is possible to regress some of them together (these subgroupings correspond to one-hot encodings of the data coding for that UKB field)
Note: for details on the Neale Lab repo organizations see Nealelab/UK_Biobank_GWAS#36.
Metadata
Metadata
Assignees
Labels
No labels