Date of Award

Fall 2006

Degree Type

Thesis - Restricted

Degree Name

Master of Science (MS)


Mathematics, Statistics and Computer Science

First Advisor

Struble, Craig A.

Second Advisor

Sem, Daniel S.

Third Advisor

Madiraju, Praveen


With the development of internet-based technologies, online publication and retrieval of scientific papers has become possible. Specific online literature databases like PubMed and Medline are now available in which scientific articles are stored and made accessible to the public. Very often, biologists want to gather papers related to a particular topic and extract specific pieces of information, which they can use in their further experiments. PubMed is the main source of biomedical literature. The overwhelming amount of literature available along with the growing rate of submission of articles in PubMed makes it difficult and time-consuming for a biologist to scan through all the articles manually. Hence, there is a need for a tool that can automatically read and sort the journal articles in PubMed and mine relevant knowledge. The goal of this research is to create a user- friendly system to extract information on Protein Kinase C (PKC) inhibitors using machine learning and text-mining techniques. Protein Kinase C plays an important regulatory role in various cellular signaling processes. Deregulation of protein kinase C is responsible for the pathogenesis of many human diseases, usually by over-expression of the kiqase. Inhibition of these kinases is considered as a strategy for the treatment of PKC-associated diseases and hence researchers are interested in information on PKC inhibitors. The goal of this research project is achieved in two main steps. The first step is to pick relevant papers from PubMed, and the second step is extraction of PKC inhibitor information from PKCrelated papers. Identification of relevant papers from PubMed is implemented using a modified SVM-based tool created by previous work with an accuracy of 73.5%. A CRF (Conditional Random Field) model built using 11439 sentences is used for extracting inhibitor names. This model performed with an F1 score of 86%. The CRF model did not perform well in extraction of concentration values and so, a pattern matching approach using regular expression is used to extract inhibitor concentration values with an accuracy of72.5% F1 score. We also automate both processes (classification of papers and extraction of inhibitor information) and create an interactive web-based tool (PKC-Miner) for querying and displaying specific information related to PKC.