Date of Award

Summer 2009

Degree Type

Thesis - Restricted

Degree Name

Master of Science (MS)


Mathematics, Statistics and Computer Science

First Advisor

Struble, Craig A.

Second Advisor

Tomita-Mitchell, Aoy

Third Advisor

Merrill, Stephen J.


Hypoplastic Left Heart Syndrome (HLHS) is one of the most complex forms of congenital heart disease (CHD); CHD is a leading cause of infant mortality and childhood morbidity due to birth defects. A widely accepted view is that CHD is a complex disease, with both environmental and inherited genetic risk factors; the most recent study suggests that HLHS is highly heritable with heritability estimates at 32% for HLHS alone. The study implicates HLHS as a complex trait and may be due to a sporadic unknown factor hidden in structural variations, known as copy number variants, and sequence variants directly linked to cardiac development genes or the more conspicuous genetic buffering pathways, such as the regulation of folate and environmental risk factors. We aimed to provide a reliable and efficient long term solution for the identification, organization and storage of rare, de novo CNVs and a statistical modeling tool for investigation of CNV disease association. Utilizing a relational database for housing clinical diagnosis cross mapped to specimens and CNV regions, we analyzed three data sets through an elastic net logistic regression model: Scenario 1: 121 HLHS patient specimen and 1623 control data; Scenario 2: 121 HLHS patient specimen and 462 control data (excluded Hypertension controls); and Scenario 3: 121 HLHS patient specimen and 121 controls. In an HTML report format, we sorted CNV regions in decreasing order of their relevance according to the weight magnitude assigned by the model. The top ten results of the balanced data set report (scenario 3) seen in only HLHS patients and no controls produced the following candidate genes for further investigation: FSCB, ATP6G1, PRIM2* and GUSBL2* (BC064931 *; C60rt216*), PLGLB1, PLGLB2, HS3ST5*, AK126330; NIMA, SEMA3C*, OLFM3*, HRPT2*, AK026969* and CRIM1*. An HTML rare regions report displayed for all data (scenario 1) 175 regions that were longer than 30,000 base pairs, that had fewer than 5 subjects with a variation in that region, and 65 regions of which were longer than 100,000 base pairs. Both reports display regions as hyperlinks that go directly to the UCSC genome browser page for that region. This methodology has broader applicability to other complex diseases and will lead to improved understanding of the underlying genetic architecture of structural defects, disease and translational benefits. *No direct gene hits found in region specified; zooming out, this is the closest neighboring gene.