Method for extraction of information from very small databases

Nancy Laning Sobczak, Marquette University

Abstract

Small databases present unique challenges for the extraction of meaningful information. This is true relative to genetic biological databases where the acquisition of data from animal models is both costly and limited. However, we have a choice; either find a method for extracting information from this small database or ignore the information that has been acquired at considerable cost and effort. This dissertation presents an archetype sentinel method for extracting meaningful information for this very important area of research. A clustering method was developed to identify 'sentinel' subjects from a very small data base. This method applies two computational techniques to create a mechanism for identifying subjects that are characteristic of one of two groups. We used a biological database which contains parameters (phenotypes) characterizing normal Brown Norway (BN) rats and hypertensive salt sensitive consomic 20 (SS20BN) rats. Fuzzy Cluster Means (FCM) is used to distinguish a very small group of archetype sentinel subjects from the general population. Archetype sentinels are those subjects that are most often classified correctly within a limited portion of the entire collection of biomedical phenotypes. The archetype sentinels are then used to train a Neural Network (NN) to classify all subjects. A total of 79 rats were analyzed with the FCM method yielding 6 normal (BN) archetypes and 5 hypertensive (SS20BN) archetype subjects characterized by a total of 39 phenotypes (18 renal and 21 cardiac.) These 11 archetype sentinels then were used as a NN training set to classify the non-archetype sentinel subjects as either normal or hypertensive. All of the archetype sentinels plus 10 of 11 BN and 10 of 10 SS20BN non-archetype rats were properly classified. Overall, roughly 97% of the rats were classified correctly. Results demonstrate that the FCM method can be used to isolate archetype sentinels, which then can be used to train a hardlim perceptron NN to determine classification of unrelated rats with the same genetic background. This approach can be generalized and used to classify small data bases in other applications.

This paper has been withdrawn.