BigDataDIRAC: deploying distributed Big Data application

The Distributed Infrastructure with Remote Agent Control (DIRAC) software framework allows a user community to manage computing activities in different grid and cloud environments. Many communities from several fields (LHCb, Belle II, Creatis, DIRAC4EGI multiple community portal, etc.) use DIRAC to run jobs in distributed environments. Google created the MapReduce programming model offering an efficient way of performing distributed computation over large data sets. Several enterprises are providing Hadoop cloud based resources to their users, and are trying to simplify the usage of Hadoop in the cloud. Based in these two robust technologies, we have created BigDataDIRAC, a solution which gives users the opportunity to access multiple Big Data resources scattered in different geographical areas, such as access to grid resources. This approach opens the possibility of offering not only grid and cloud to the users, but also Big Data resources from the same DIRAC environment. Proof of concept is shown using three computing centers in two countries, and with four Hadoop clusters. Our results demonstrate the ability of BigDataDIRAC to manage jobs driven by dataset location stored in the Hadoop File System (HDFS) of the Hadoop distributed clusters. DIRAC is used to monitor the execution, collect the necessary statistical data, and upload the results from the remote HDFS to the SandBox Storage machine. The tests produced the equivalent of 5 days continuous processing.

keywords: Big Data, DIRAC, MapReduce, Hadoop, Hive, Cloud Computing, Multi-cloud environment