GWAS script tasks

TODOs in the actual GWAS script:

- [x] Regress on complete cases only
    - This is not explicit in the Neale Lab code but it is implicit in the `linreg3` call at https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/imputed-v2-gwas/7_run_linreg3.py#L63 and https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/0.1/22.run_regressions.py#L92
    - From the [linreg3 docs](https://hail.is/docs/0.1/hail.VariantDataset.html#hail.VariantDataset.linreg3):
        > linreg3() uses the same set of samples for each phenotype, namely the set of samples for which all phenotypes and covariates are defined.
- [X] Add interaction terms in covariates
    - These were added as they are in https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/0.1/22.run_regressions.py#L71
- [x] Determine how many PCs to use
    - 20 are used in https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/0.1/22.run_regressions.py#L71
    - The [Details and Considerations](http://www.nealelab.is/blog/2017/9/11/details-and-considerations-of-the-uk-biobank-gwas) post (which pertains to the earlier results from the code in `imputed-v2-gwas`) states:
    > To simplify the process of association testing, association for all phenotypes used a least-squares linear model predicting the phenotype with an additive genotype coding (0, 1, or 2 copies of the minor allele), with sex and the first 10 principal components from the UK Biobank sample QC file as covariates.
    - **Conclusion**: 20 PCs will be used
- [X] Determine which sex + age fields we should be using (is it genetic sex or one of the others?)
  - Age is set here https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/0.1/00.load_sample_qc_kt.py#L61 as field [21022](https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=21022) (age at recruitment).
  - It is not entirely clear where this comes from in https://github.com/Nealelab/UK_Biobank_GWAS/blob/67289386a851a213f7bb470a3f0f6af95933b041/0.1/00.load_sample_qc_kt.py#L14, but presumably this means the inferred genetic sex field [22001](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=22001)
- [x] Change input to intermediate results following https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/9
- [x] With https://github.com/pystatgen/sgkit/pull/391 in, add beta values to results and make sure that full computation does not run twice for both that and p values
- [ ] Decide whether or not to attempt sample QC filter for "Use 7 standard deviations away from the 1st 6 PCs" 
- [ ] Add separate MAF threshold for coding variants
- [ ] Add tolerance to argmax used in hard call logic (make ties NA)
  - Waiting on https://github.com/pystatgen/sgkit/pull/419
- [ ] Group phenotypes to regress together (see https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/imputed-v2-gwas/4_build_pipelines.py#L27)
    - PHESANT output columns might look like 6153_1, 6153_2, 6153_3, ..., 6153_100, 4620 so it is possible to regress some of them together (these subgroupings correspond to one-hot encodings of the data coding for that UKB field)

Note: for details on the Neale Lab repo organizations see https://github.com/Nealelab/UK_Biobank_GWAS/issues/36.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GWAS script tasks #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GWAS script tasks #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions