On the road to a unified Big Data and HPC framework

Nowadays we are living in the Big Data era. The de facto standards for parallel processing of Big Data are Apache Hadoop and Apache Spark engines. These frameworks require implement their applications in Python, Java or Scala using programming paradigms such as MapReduce. However, HPC applications have always been implemented in Fortran and C/C++ in order to exploit multithreading (OpenMP) and multiprocessing (MPI) capabilities of clusters. As a consequence, this interoperability divergence between HPC and Big Data languages and programming models difficulties the creation of applications that bring together the advantages of HPC and Big data worlds. To deal with that issue we introduce Ignis, a new Big Data-HPC framework that allows the execution of applications that combine multiple programming languages without additional overhead. Our framework uses a multi-language RPC approach to create a native executor for each language and a modular design pattern that facilitates the inclusion of new languages. All Ignis communications are internally implemented using MPI collective operations, which allows users to execute MPI native applications on an Ignis cluster. In this way, MPI and Spark-like codes can be combined in the same application. Unlike previous works, our proposal is a step forward in the convergence of HPC and Big Data since applications belonging to both worlds can be executed efficiently in the same framework. Note that Ignis is fully developed and executed inside Docker containers, which avoids any system dependency and library incompatibility issues. Our architecture is divided into four independent main modules: Submitter, Backend, Driver and Executor. They are coded in different languages, using Apache Thrift over an SSH tunnel for the inter- module communications. The submitter is a simple script similar to spark-submit that calls a cluster manager (Mesos) to create a driver instance. It lives in the Submitter container where users prepare the environment before launching a job. The Executor contains a collection of functions that the Backend uses to implement its logic. Each language has its own implementation and can communicate with each other to share data. The Driver is a user API through which users can access all the available functionalities of the framework. This module does not perform any heavy computation and uses Thrift RPC to delegate its work to the Backend that implement the Driver functionality. Performance tests were carried out sorting in ascending order 1TB of text data, which contains about two billion lines. We have compared the sort built-in capability of Spark with Ignis using a C++ executor and a Python driver code. Ignis Terasort is between 2.06x and 1.45x faster than Spark using from 4 to 10 nodes with 32 cores per node. Regarding to the memory consumption, Spark has a penalization of extra 40 bytes per string (for each of the lines in the input data). Moreover, it requires 128MB for each JVM. On the other hand, a C++ application only needs 24 extra bytes per string. It means that, ignoring the JVMs overhead, Spark will always consume 1.6x more memory than the Ignis C++ executor.

keywords: HPC, Big data