PhD Defense: 'Application of machine learning to agricultural soil data'

One of the main risks of the Indian agriculture is the soil degradation, caused by defficient or wrong application of fertilizers. Development of village-wise fertility maps for several relevant soil nutrients provides knowledge about land fertility for an adequate use of fertilizers, but it requires to measure soil chemical parameters in many cultivation lands.

In order to reduce this work and to save time of specialized technicians, this thesis applies classification and regression methods which automatically predict the nutrient levels and their village-wise fertility indices. Specifically, using data from the Indian region of Marathwada we classify the village-wise fertility indices of organic carbon, phosphorus pentoxide, manganese and iron; the levels of nutrients phosphorus pentoxide, nitrous oxide and potassium oxide; soil acidity (pH); recommended crop and soil type.

We use 20 classification methods including, among others, bagging and boosting ensembles, decision trees, nearest neighbors, extreme learning machines with and without Gaussian kernel, multi-layer perceptrons, probabilistic and radial basis function (RBF) neural networks, random forests, support vector machines and rule-based classifiers. The random forests achieve the best results for almost all the problems, with accuracies between 69-99% for all the problems excepting the nutrient levels (about 55%). For some classification problems, the models trained on data from Marathwada are compatible with other two neighbor regions (Paschim Maharashtra and North Maharashtra).

In order to predict the numeric values of village-wise soil fertility indices for organic carbon, phosphorus pentoxide, manganese, iron and cinc, this thesis also develops a study comparing a large collection of 77 regression methods belonging to 20 families: linear and generalized linear regression, least and partial least squares, least absolute shrinkage and selection operator (LASSO) and ridge regression, neural networks, deep learning, support vector regression, regression trees, bagging and boosting ensembles, random forests, nearest neighbors, Bayesian models, principal component analysis, generalized additive models, Gaussian processes, quantile regression and other methods: least angle regression, multivariate adaptive regression splines (MARS), projection pursuit regression (PPR) and substractive clustering with fuzzy C-means. The comparison uses 66 regression datasets from the UCI machine learning repository, and reveals that the extreme learning machine (ELM), the support vector regression (SVR), both with Gaussian kernel, the extremely randomized regression trees (extraTrees) and the gradient boosting machine (GBM) achieve the highest performances, overcoming 90% of the highest squared correlation in average over the whole dataset collection. Applied to our five previous fertility indices, extraTrees achieves the best results, with squared correlations between 0.57-0.76 which correspond to a fairly accurate prediction. In fact, considering the nutrient fertility levels (low, medium and high) instead of the numeric values, the fertility level is predicted with an accuracy between 79-98% depending on the nutrient, which can be considered a reliable automatic prediction for the development of village-wise soil fertility maps.