Date of Award

Summer 2006

Degree Type

Thesis - Restricted

Degree Name

Master of Science (MS)


Mathematics, Statistics and Computer Science

First Advisor

Struble, Craig A.

Second Advisor

Sugg, Sonia L.

Third Advisor

Johnson, Michael T.


The extent of biomedical knowledge available in the forn of published literature is astounding. Public databases and services such as MEDLINE and PubMed, complied by National Library of Medicine (NLM), are a source of life sciences and biomedical bibliographic information and provide access to over 16 million citations. This knowledge base of biomedical literature is growing by the day with approximately 50,000 biomedical articles published and added to PubMed every month. Manually searching for information, pertinent to one's interest, in such large databases is a slow and tedious process. Therefore, the need for effective and efficient text mining tools for organizing, curating and extracting useful information is increasing. In this research project, we explore the problem of identifying experimental technique names that appear in a paper using tagging and document classification. The motivation behind this project is to annotate gene and disease relationships with experimental techniques that were used for establishing those relationships. This information can be helpful in analyzing and validating such relationships. For tagging experimental techniques, we followed an ontology based approach using Medical Subject Heading (MeSH) and Unified Medical Language System (UMLS). MeSH and UMLS are vocabulary databases that provide information about biomedical and health related concepts and the relationships among them. These databases are used for indexing, cataloging and searching biomedical documents. For the document classification approach a naive Bayes classifier was trained with a set of documents that were manually assigned to one or more pre-identified classes. Tagging approaches were limited to the dictionaries used and had low inter-annotator agreement. Use of classification for identifying experimental techniques proved to be more successful. We compared the performance of classification models based on full text of the article, based on abstract and titles only and five section based models- Introduction, Methods and Materials, Results, Discussion and Figure and Table Legends. The abstract and title, Results and Figure and Table Legends based models were successful and often competitive with full-text classification. Materials and Methods model performed slightly better than full-text with a micro-average Fl score of approximately 67 points on 3000 and 1000 size vocabulary suggesting that use of Material and Methods section alone can be sufficient for classification. We also experimented with various sizes of vocabulary and our results indicate that pruning the vocabulary has a positive effect on classification performance.



Restricted Access Item

Having trouble?