Date of Award
Master of Science (MS)
Mathematics, Statistics and Computer Science
Stamborski, Valerie Trapp
Cancer is a genetic disease caused by the accumulation of DNA variants such as single nucleotide changes or insertions/deletions in DNA. DNA variants can cause silencing of tumor suppressor genes or increase the activity of oncogenes. In order to come up with successful therapies for cancer patients, these DNA variants need to be identified accurately. DNA variants can be identified by comparing DNA sequence of tumor tissue to a non-tumor tissue by using Next Generation Sequencing (NGS) technology. But the problem of detecting variants in cancer is hard because many of these variant occurs only in a small subpopulation of the tumor tissue. It becomes a challenge to distinguish these low frequency variants from sequencing errors, which are common in today's NGS methods. Several algorithms have been made and implemented as a tool to identify such variants in cancer. However, it has been previously shown that there is low concordance in the results produced by these tools. Moreover, the number of false positives tend to significantly increase when these tools are faced with low frequency variants. This study presents SNPredict, a single nucleotide polymorphism (SNP) detection pipeline that aims to utilize the results of multiple variant callers to produce a consensus output with higher accuracy than any of the individual tool with the help of machine learning techniques. By extracting features from the consensus output that describe traits associated with an individual variant call, it creates binary classifiers that predict a SNP’s true state and therefore help in distinguishing a sequencing error from a true variant.