Efficient Access to Heterogeneous Environmental Data Repositories Through Linked Data Standards

Environmental data is of great importance in many fields of science and engineering. Due to that, an important amount of resources is spent every year in the construction and maintenance of complex environmental observation and modeling infrastructures, which are now also complemented with citizen data provided through crowdsensing approaches. Based on the geospatial dimension of the data, and broadly speaking, two major types of environmental datasets may be identified: i) Vector Features corresponding to entities or objects that have properties of conventional data types and possibly properties of geometric data types, which are represented with vectors of spatial coordinates. ii) Raster Coverages that represent the variation in space and/or time of conventional properties. Geospatial and environmental data infrastructures enable the storage of the above types of data, and also the discovery of the datasets and their access through standard web service interfaces. However, the use of those infrastructures is generally limited to experts with very specific skills. Contrary to the above, general purpose open data infrastructures are evolving towards the construction of a linked web of data, using technologies from the semantic web area. Those technologies include the RDF data model and the SPARQL query language. Many application domains would benefit from the incorporation of geospatial and environmental data sources to the linked data web. In general, such geospatial enablement of the semantic web would allow ICT practitioners to bring environmental data sources closer to a wider variety of citizens problems. To achieve this however, underlying technologies have to be extended to support geospatial data. Some efforts have already been done, and thus, the GeoSPARQL standard supports already the querying of Vector Features with a reasonable performance. However, the querying of Raster Coverages and also the integrated querying of both types of data is not adequately supported by current semantic web query engines. Based on the above, this Thesis proposes advances in various components of a new GeoSPARQL query engine, called GeoLD, that enables the efficient integrated querying of heterogeneous environmental datasets, including Vector Features and Raster Coverages, which are available through standard web services of geospatial data infrastructures. The proposed solution enables the access to Raster Coverages by providing a new Coverage to RDF Mapping Language (C2RML), which enables the programmer to incorporate specific vocabularies during the definition of the mapping between the coverage data schema and RDF. New SPARQL operators and a new query optimization strategy provide the algorithms required to reach query response times orders of magnitude faster than those of state of the art technologies. Additionally, contrary to existing approaches, raster data querying is achieved without the need to use large lists of specific raster manipulation functions, which simplifies the programming task. Future research lines are mainly related to the improvement of the data representation, query processing and optimization artifacts of the system, and also to the increasing of the Technology Readiness Level (TRL) of the implementation, to bring it closer to real technology transfer objectives.

keywords: Geospatial linked data, Scientific linked data, Array linked data, Raster linked data, GeoSPARQL, Spatial query processing