Date of Award

Spring 2004

Degree Type

Thesis - Restricted

Degree Name

Master of Science (MS)


Mathematics, Statistics and Computer Science

First Advisor

Struble, Craig A.

Second Advisor

Merrill, Stephen

Third Advisor

Feng, Xin


The amount of online documents has grown tremendously in recent years that poses challenges for information retrieval from this vast collection. Text Mining, an application of machine learning addresses these challenges by providing techniques for information extraction from large text collections. One of the major areas of applications of text mining is biomedicine. The rapid growth of research in biomedical area is giving rise to a large number of literature published every year. It is difficult to keep pace with the current and related research in an area of interest. It is also difficult and time-consuming to read all the literature retrieved by a keyword search on a topic of interest. An efficient approach to address this problem is document clustering that generates meaningful groups of concepts which provide a better description of the data in a document collection. This study investigated document clustering of biomedical literature to identify concepts represented in large document collections. Biomedical literature is indexed by a controlled vocabulary, MeSH (Medical Subject Headings) which represent the major concepts discussed in a document. We compared the use of MeSH in representing the documents with that of full-text representation for document clustering.