Date of Award

Summer 2008

Degree Type

Thesis - Restricted

Degree Name

Master of Science (MS)

Department

Mathematics, Statistics and Computer Science

First Advisor

Madiraju, Praveen

Second Advisor

Merrill, Stephen J.

Third Advisor

Struble, Craig A.

Abstract

Background The rapid growth of public literature databases like MEDLINE has created the need to efficiently store, retrieve and update the millions of scholarly articles and literature they contain. Moreover, the sheer size of MEDLINE makes querying them and extracting needed information an arduous process. Therefore, there is a need for a rapid and streamlined process to update, store and query MEDLINE in the most efficient manner. We believe that using alternative database systems like Native XML databases (NXD) will greatly speed up the update process significantly. Results We used existing and self-developed software packages to parse and load the 2006 release of MEDLINE into two different kinds of database systems, namely an NXD (Berkeley DB) and a relational database system (PostgreSQL). The two systems were compared using data collected on loading and parsing times, disk-space utilization and query performance. The loading times for the Berkeley DB and the PostgreSQL implementations were 48 hours and 92 hours respectively. The Berkeley DBXML database occupied 150.3 GB of disk space, while the PostgreSQL database used 16.8 GB in disk space. Conclusions The NXD offered a significantly faster performance in terms of data parsing and loading times. It was also easier to update and maintain, compared to the relational database system. However, in comparison, the relational database system we tested offered better performance in querying large datasets and was also significantly lower on disk space utilization. Beyond the scope of this project, we expect these findings to provide an area of focus to benefit future development of native XML database systems

Share

COinS