Date of Award
Spring 2015
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Electrical and Computer Engineering
First Advisor
Povinelli, Richard J.
Second Advisor
Corliss, George F.
Third Advisor
Brown, Ronald H.
Abstract
This dissertation addresses the problem of data cleaning in the energy domain, especially for natural gas and electric time series. The detection and imputation of anomalies improves the performance of forecasting models necessary to lower purchasing and storage costs for utilities and plan for peak energy loads or distribution shortages. There are various types of anomalies, each induced by diverse causes and sources depending on the field of study. The definition of false positives also depends on the context. The analysis is focused on energy data because of the availability of data and information to make a theoretical and practical contribution to the field. A probabilistic approach based on hypothesis testing is developed to decide if a data point is anomalous based on the level of significance. Furthermore, the probabilistic approach is combined with statistical regression models to handle time series data. Domain knowledge of energy data and the survey of causes and sources of anomalies in energy are incorporated into the data cleaning algorithm to improve the accuracy of the results. The data cleaning method is evaluated on simulated data sets in which anomalies were artificially inserted and on natural gas and electric data sets. In the simulation study, the performance of the method is evaluated for both detection and imputation on all identified causes of anomalies in energy data. The testing on utilities' data evaluates the percentage of improvement brought to forecasting accuracy by data cleaning. A cross-validation study of the results is also performed to demonstrate the performance of the data cleaning algorithm on smaller data sets and to calculate an interval of confidence for the results. The data cleaning algorithm is able to successfully identify energy time series anomalies. The replacement of those anomalies provides improvement to forecasting models accuracy. The process is automatic, which is important because many data cleaning processes require human input and become impractical for very large data sets. The techniques are also applicable to other fields such as econometrics and finance, but the exogenous factors of the time series data need to be well defined.