Skip to content

This repo features the code, input and output files for my undergraduate thesis titled "Design and optimization of an integrated computational and machine learning framework to detect functional variants and elucidate their contributions to biological pathways".

Notifications You must be signed in to change notification settings

NikitaArya17/undergraduate-thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 

Repository files navigation

Design and optimization of an integrated computational and machine learning framework to detect functional variants and elucidate their contributions to biological pathways

For my undergraduate thesis, completed from August 2025 to May 2026 (currently ongoing) at the Komplex Systems Laboratory, Ahmedabad University under the supervision of Professor Krishna Swamy, I am developing a method for the design and optimisation of a computational framework that integrates the analysis of WGS data, including variant calling and Copy Number Variation (CNV) determination, with an ML ensemble comprising a Naïve Bayes (NB) classifier, a Support Vector Machine (SVM) and an Artificial Neural Network (ANN) that is used to identify functional variants and predict their possible roles in a biological pathway.

WGS reads were aligned to the reference genomes using bwa-mem2. After filtering for duplicate reads and low-quality bases, GATK’s HaplotypeCaller, SelectVariants and VariantFiltration tools were used to identify SNPs and indels. Variant annotation was performed with SnpEff and SnpSift. The Copy Number Variation of the genome was determined with Control-FREEC. Each step of the pipeline was run on Stepwell, the university's HPC cluster, which Slurm to schedule and run jobs.

The machine learning pipeline has been built on the code used in this study. The NB Classifier and SVM will be built in R, which has also been used to preprocess the input data and convert it into a format that is suitable for training ML models. Python will be used to build the ANN.

About

This repo features the code, input and output files for my undergraduate thesis titled "Design and optimization of an integrated computational and machine learning framework to detect functional variants and elucidate their contributions to biological pathways".

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published