Date of Award

Summer 2005

Degree Type

Thesis - Restricted

Degree Name

Master of Science (MS)

Department

Mathematics, Statistics and Computer Science

First Advisor

Struble, Craig A.

Second Advisor

Twigger, Simon N.

Third Advisor

Ahamed, Sheikh I.

Abstract

Information in genome databases is maintained and updated by curators, ensuring that it is current and authentic. To achieve this goal, curators refer to research articles to refine the scientific knowledge stored in these databases, as this literature is an important source for such information. Curators have to pick papers relevant to the database they are maintaining from the literature. The vastly growing literature makes it a challenge to find crucial and relevant information, making curators fall behind the latest publications. The identification of papers relevant to a particular subject is an example of text categorization. In this research we focus on creating a web based software tool that utilizes support vector machines (SVM) as a classifier. The SVM classifies papers as relevant or irrelevant by categorizing text from abstracts. By creating software tools that implement text categorization algorithms, biomedical literature can be more effectively curated. Software tools that can help curators with the task of selecting highly relevant papers out of the large volume of literature would greatly benefit the curation process. This tool achieves an average accuracy of 94.45% and precision and recall of 96.34% and 94.74% respectively when classifying papers relevant to needs of the Rat Genome Database (RGD).

Share

COinS