MAGIST-ELA: Large Scale Geo-processing for Exploratory and Learning Analytics

Problems related to road traffic are a major concern of city authorities and therefore key challenge for modern urban Intelligent Transportation Systems (ITS). They include road traffic flow and relevant environmental impact analysis and also road infrastructure degradation analysis.
The advances in sensing technologies and the implication of the citizens through crowdsensing mobile applications are leading to the production of amounts of data with unprecedented generation rates. Due to the above, a shift of paradigm has been identified from traditional technology-driven to modern data-driven ITS, where large amounts of heterogeneous sensor data are used to train learning algorithms. Big Data gained much interest in this area and important challenges were posed in all software layers.
Much of the ITS data is of geospatial nature, including both, vector and raster data. Traditionally, vector and raster data are stored and managed by distinct technologies. Recently, the so called Data Lake emerged as a new distributed data storage architecture for modern data warehouses. Current spatial extensions to these scalable data processing frameworks are mainly designed with vector data in mind.
Despite of the advances in large scale data processing technologies, and even if we focus on vector data, the response times required for interactive exploratory analysis of very large datasets are still an elusive objective. If we consider on the other hand machine learning based analytics, parallel implementations have enabled the application of these techniques to training sets with sizes difficult to imagine years ago. However, the cost of such highly parallel solutions and their environmental impact are frequently too high. To tackle the above problems, specific query processing techniques have been proposed. Interactive response times may be reached by using approximate query processing developed on top of dataset synopses, which include samplings and sketches. Regarding machine learning, last generation solutions express the learning process directly as a batch of queries over the input relational data, which next may be optimized, resulting in performance gains of various orders of magnitude with respect to the traditional use of machine learning tools over materialized views of the database. The specificities of spatial data and in special of raster data have not been deeply studied in none of the above query processing approaches.
Based on all the above, the main objective of the MaGIST-ELA subproject is the development of efficient query processing solutions over very large heterogeneous (vector and raster) geospatial data lakes, to support both exploratory and learning analytic workloads that arise in the scope of urban scale intelligent road traffic analysis. To achieve this, first machine learning solutions will be applied to heterogeneous geospatial sensor data for road traffic flow monitoring and prediction, which will subsequently be used in intelligent air quality monitoring and prediction. Machine learning will also be used to estimate pavement degradation from crowdsensed mobile data. Next, data storage and approximate query processing techniques will be designed to support exploratory integrated analysis of very large vector and raster datasets. Finally, efficient query processing techniques will be designed and implemented to support workloads demanded by machine learning over raster and vector geospatial data.

Objectives

The main objective of the subproject MaGIST-ELA (USC) is the development of efficient query processing techniques over very large heterogeneous (vector and raster) data lakes and their application for the resolution of geospatial analytics, which arise in smart urban scale road traffic analysis. The considered analytics include: i) exploratory analysis tasks performed to browse and navigate the data lake and ii) training and evaluation of machine learning techniques. This main research objective is further subdivided into the following two specific objectives:

Design and implementation of solutions based on the use of machine learning techniques for: a) monitoring and prediction of road traffic and its impact in the monitoring and prediction of air quality and, b) monitoring and prediction of pavenment degradation using crowd-sensed mobile data.
Development of data storage and query processing techniques required for the efficient implementation of some of the analytics considered in the above objective.

Problems related to road traffic are a major concern of city authorities and therefore key challenge for modern urban Intelligent Transportation Systems (ITS). They include road traffic flow and relevant environmental impact analysis and also road infrastructure degradation analysis. The advances in sensing technologies and the implication of the citizens through crowdsensing mobile applications are leading to the production of amounts of data with unprecedented generation rates. Due to the above, a shift of paradigm has been identified from traditional technology-driven to modern data-driven ITS, where large amounts of heterogeneous sensor data are used to train learning algorithms. Big Data gained much interest in this area and important challenges were posed in all software layers. Much of the ITS data is of geospatial nature, including both, vector and raster data. Traditionally, vector and raster data are stored and managed by distinct technologies. Recently, the so called Data Lake emerged as a new distributed data storage architecture for modern data warehouses. Current spatial extensions to these scalable data processing frameworks are mainly designed with vector data in mind. Despite of the advances in large scale data processing technologies, and even if we focus on vector data, the response times required for interactive exploratory analysis of very large datasets are still an elusive objective. If we consider on the other hand machine learning based analytics, parallel implementations have enabled the application of these techniques to training sets with sizes difficult to imagine years ago. However, the cost of such highly parallel solutions and their environmental impact are frequently too high. To tackle the above problems, specific query processing techniques have been proposed. Interactive response times may be reached by using approximate query processing developed on top of dataset synopses, which include samplings and sketches. Regarding machine learning, last generation solutions express the learning process directly as a batch of queries over the input relational data, which next may be optimized, resulting in performance gains of various orders of magnitude with respect to the traditional use of machine learning tools over materialized views of the database. The specificities of spatial data and in special of raster data have not been deeply studied in none of the above query processing approaches. Based on all the above, the main objective of the MaGIST-ELA subproject is the development of efficient query processing solutions over very large heterogeneous (vector and raster) geospatial data lakes, to support both exploratory and learning analytic workloads that arise in the scope of urban scale intelligent road traffic analysis. To achieve this, first machine learning solutions will be applied to heterogeneous geospatial sensor data for road traffic flow monitoring and prediction, which will subsequently be used in intelligent air quality monitoring and prediction. Machine learning will also be used to estimate pavement degradation from crowdsensed mobile data. Next, data storage and approximate query processing techniques will be designed to support exploratory integrated analysis of very large vector and raster datasets. Finally, efficient query processing techniques will be designed and implemented to support workloads demanded by machine learning over raster and vector geospatial data.The main objective of the subproject MaGIST-ELA (USC) is the development of efficient query processing techniques over very large heterogeneous (vector and raster) data lakes and their application for the resolution of geospatial analytics, which arise in smart urban scale road traffic analysis. The considered analytics include: i) exploratory analysis tasks performed to browse and navigate the data lake and ii) training and evaluation of machine learning techniques. This main research objective is further subdivided into the following two specific objectives: <ol> <li>Design and implementation of solutions based on the use of machine learning techniques for: a) monitoring and prediction of road traffic and its impact in the monitoring and prediction of air quality and, b) monitoring and prediction of pavenment degradation using crowd-sensed mobile data.</li> <li>Development of data storage and query processing techniques required for the efficient implementation of some of the analytics considered in the above objective.</li> </ol> - José Ángel Taboada González