An ensemble-based phenotype classifier to diagnose Crohn’s disease from 16s rRNA gene sequences

In the past few years, one area of bioinformatics that has sparked special interest is the classification of diseases using machine learning. This is especially challenging in solving the classification of dysbiosis-based diseases, i.e., diseases caused by an imbalance in the composition of the microbial community. In this work, a curated pipeline is followed for classifying phenotypes using 16S rRNA gene amplicons, focusing on Crohn’s disease. It aims to reduce the dimensionality of data through a feature selection step, decreasing the computational cost, and maintaining an acceptably high f1-score. From this study, an ensemble model is proposed to contain the best-performing techniques from several representative machine learning algorithms. High f1-scores of up to 0.81 were reached thanks to this ensemble joining multilayer perceptron, extreme gradient boosting, and support vector machines, with as low as 300 target number of features. The results achieved were similar to or even better than other works studying the same data, so we demonstrated the goodness of our method.

keywords: phenotype classification, 16S rRNA gene, ASVs, ensemble-based classification, K-nearest neighbors, Support Vector Machines