OPERA-gSAM: Big Data Processing Framework for UMI Sequencing at High Scalability and Efficiency

The rapidly increasing demand for next-generation biotechnologies has enabled the development of DNA and RNA big data-oriented (BD) pipelines. The preprocessing stage requires sequencing and alignment tools that provide barcoding for error correction and increase accuracy during sequencing. Unique Molecular Identifiers (UMIs) promise a highly accurate bioinformatic identification of PCR duplication before the amplification stage. However, using alignment coordinates alone is Data-intensiveand challenging due to the increased demand for computational throughput, affecting the performance of the underlying resources. This paper proposes a highly scalable data scheduling and resource allocation framework called OPERA-gSAM for the genome Sequence Alignment Map (SAM). OPERA-gSAM, an OPportunistic and Elastic Resource Allocation, is an enabling big data platform (i.e., Apache Spark) for the next-generation massively parallel sequencing applications. We validate OPERA-gSAM scalability and efficiency using Genomics single-cell RNA sequencing. Our experiments demonstrate the usability and high efficiency of the proposed framework. Results show that OPERA-gSAM is up to 2.4× faster while consuming 50% fewer resources than the conventional pipeline using SAM and UMI tools.

keywords: genome Sequence Alignment Map