Kernel machine learning methods to handle missing responses with complex predictors. Application in modelling five-year glucose changes using distributional representations
Background and objectives: Missing data is a ubiquitous problem in longitudinal studies due to
the number of patients lost to follow-up. Kernel methods have enriched the machine learning
field by successfully managing non-vectorial predictors, such as graphs, strings, and probability
distributions, and have emerged as a promising tool for the analysis of complex data stemming
from modern healthcare. This paper proposes a new set of kernel methods to handle missing data
in the response variables. These methods will be applied to predict long-term changes in glycated
haemoglobin (A1c), the primary biomarker used to diagnose and monitor the progression of
diabetes mellitus, making emphasis on exploring the predictive potential of continuous glucose
monitoring (CGM).
Methods: We propose a new framework of non-linear kernel methods for testing statistical
independence, selecting relevant predictors, and quantifying the uncertainty of the resultant
predictive models. As a novelty in the clinical analysis, we used a distributional representation of
CGM as a predictor and compared its performance with that of traditional diabetes biomarkers.
Results: The results show that, after the incorporation of CGM information, predictive ability
increases from R 2 = 0.61 to R 2 = 0.71. In addition, uncertainty analysis is useful for
characterising some subpopulations where predictivity is worsened, and a more personalised
clinical follow-up is advisable according to expected patient uncertainty in glucose values.
Conclusions: The proposed methods have proven to deal effectively with missing data. They
also have the potential to improve the results of predictive tasks by including new complex
objects as explanatory variables and modelling arbitrary dependence relations. The application
of these methods to a longitudinal study of diabetes showed that the inclusion of a distributional
representation of CGM data provides greater sensitivity in predicting five-year A1c changes than
classical diabetes biomarkers and traditional CGM metrics.
keywords:
Publication: Article
1653384307568
May 24, 2022
/research/publications/kernel-machine-learning-methods-to-handle-missing-responses-with-complex-predictors-application-in-modelling-five-year-glucose-changes-using-distributional-representations
Background and objectives: Missing data is a ubiquitous problem in longitudinal studies due to
the number of patients lost to follow-up. Kernel methods have enriched the machine learning
field by successfully managing non-vectorial predictors, such as graphs, strings, and probability
distributions, and have emerged as a promising tool for the analysis of complex data stemming
from modern healthcare. This paper proposes a new set of kernel methods to handle missing data
in the response variables. These methods will be applied to predict long-term changes in glycated
haemoglobin (A1c), the primary biomarker used to diagnose and monitor the progression of
diabetes mellitus, making emphasis on exploring the predictive potential of continuous glucose
monitoring (CGM).
Methods: We propose a new framework of non-linear kernel methods for testing statistical
independence, selecting relevant predictors, and quantifying the uncertainty of the resultant
predictive models. As a novelty in the clinical analysis, we used a distributional representation of
CGM as a predictor and compared its performance with that of traditional diabetes biomarkers.
Results: The results show that, after the incorporation of CGM information, predictive ability
increases from R 2 = 0.61 to R 2 = 0.71. In addition, uncertainty analysis is useful for
characterising some subpopulations where predictivity is worsened, and a more personalised
clinical follow-up is advisable according to expected patient uncertainty in glucose values.
Conclusions: The proposed methods have proven to deal effectively with missing data. They
also have the potential to improve the results of predictive tasks by including new complex
objects as explanatory variables and modelling arbitrary dependence relations. The application
of these methods to a longitudinal study of diabetes showed that the inclusion of a distributional
representation of CGM data provides greater sensitivity in predicting five-year A1c changes than
classical diabetes biomarkers and traditional CGM metrics. - Marcos Matabuena, Paulo Félix, Carlos Meixide, Francisco Gude - 10.1016/j.cmpb.2022.106905
publications_en