abstract
| - Traditional supervised machine learning techniques need to be adapted when applied to longitudinal datasets, due to their specific characteristics such as a large amount of missing data and the dependency between repeated measurements of the same variables. These adaptations range from data preprocessing techniques that maintain and use information from the underlying temporal data structure of longitudinal datasets, to algorithm adaptations that consider the temporal aspect of the data when making predictions.
In this thesis we focus on the classification task of supervised learning, in the context of longitudinal biomedical and health data from ageing studies. More specifically, we address the problem of predicting the diagnosis of age-related diseases, given several years of observations about each instance (individual).
In order to evaluate our proposed approaches for longitudinal supervised learning (described below), we created 30 longitudinal classification datasets. These datasets are comprised of data from the English and Irish longitudinal studies of ageing, which collect biomedical and self-reported health information on thousands of participants, over multiple waves carried out throughout the years.
Regarding supervised learning algorithms, we focus on decision tree-based algorithms, namely Random Forests (which learn an ensemble of decision trees) and a decision tree algorithm (which learns a single decision tree). These algorithms were chosen because they represent a good trade-off between predictive accuracy and interpretability, which is particularly relevant for our health application. Random Forests are known to achieve high predictive accuracy in general, and are partially interpretable (via feature importance measures), whilst decision trees are directly interpretable, although usually less accurate than Random Forests.
This thesis' main contributions are three new approaches for coping with longitudinal data in supervised learning (particularly classification). The first two main contributions involve data preparation, namely missing value replacement and the construction of features representing temporal information in the data. These contributions are independent from the choice of classification algorithm to be applied to the longitudinal data, so they are widely applicable to longitudinal studies. The third main contribution involves algorithm adaptation, adapting decision tree-based algorithms to consider the temporal information in the data.
More precisely, the first main contribution of this thesis is the proposal of a data-driven missing value replacement approach to estimate the missing values in longitudinal datasets. The proposed approach performs a feature-wise ranking of an input set of missing value replacement methods, using known data as ground-truth to estimate the error rates of each method. Then it uses that ranking to choose the best missing value replacement method for each feature. Experiments have shown that this approach improved predictive accuracy in general, by comparison with several baseline methods for handling missing values.
The second main contribution consists of several types of constructed temporal features, which are are calculated (in a data preprocessing phase) using the repeated measurements of the original longitudinal features. These constructed features represent different types of temporal patterns that can occur in longitudinal datasets. The constructed features are then added to the original dataset,
and used together with the original features when running any chosen classification algorithm. Experiments have shown that the constructed features benefited from datasets with more temporal data available, and that the added features overall increased the predictive accuracy of the Random Forest classifiers.
The third main contribution of this work is an algorithm adaptation approach for decision tree-based algorithms (more precisely, Random Forests and decision tree algorithms) applied to longitudinal data inputs. We adapted the node split function of such algorithms to consider two criteria, using a lexicographic optimisation approach. This approach first tries to select the best split feature at each tree node based on the features' information gain ratio, as the primary criterion. If, however, two or more features have about the same information gain ratio, as a tie-breaking (secondary) criterion, the algorithm prefers to select a more recent feature, since these are assumed to be more relevant for classification than older features.
Experiments have shown that this lexicographic split approach led to increased predictive accuracy in general for the Random Forest classifier.
|