Towards a Big Data Multi-language Framework using Docker Containers

Most of the relevant Big Data process- ing frameworks (e.g., Apache Hadoop, Apache Spark) only support JVM (Java Virtual Machine) languages by default. In order to support non-JVM languages, subprocesses are created and connected to the framework using system pipes. With this technique, the impossibility of managing the data at thread level arises together with the loss of performance due to the overhead. In this paper we introduce a new Big Data framework that benefits from an elegant way to create multi-language executors managed through a RPC system, allowing the user to take advantage of each programming language for different tasks. The system runs completely inside Docker containers. Moreover, our framework has a custom Docker-based resource manager, responsible of assigning the available resources of a cluster. A comparison with Apache Spark shows the benefits of our proposal in terms of performance and scalability.

keywords: Big Data processing, Framework, Multi- Language, Performance, Docker, Apache Thrift