The DNA or protein sequence searching is the most obvious operation in the analysis of any new sequence and the reason for the same is pretty simple—finding similar regions of nucleotides or proteins between two or more nucleotide or protein sequences. The similarity can be used to determine many things including similarity of two or more species, identifying a completely new species, locating domains within the sequence of interest, etc. However, the difficulty in finding the similar regions between two or more sequences is very hard due to the size of the existing sequence involved. To overcome this difficulty, various tools or algorithms have been proposed. Let us look at some of these in the following paragraphs.
In computational biology and bioinformatics, aligning sequences to determine similarity between them is an essential and widely used computational procedure for biological sequences. There have been wide range of computational algorithms applied to the sequence alignment challenge. Methods like Smith–Waterman algorithm [1], which is quite slow but accurate and is based on dynamic programming, and, basic local alignment search tool (BLAST) [2] or FASTA [3] algorithm which is faster but less accurate and is based on heuristic or probabilistic programming. The very first algorithm was given by Smith and Waterman in the form of Smith–Waterman algorithm in 1981. This is a global sequential alignment algorithm which involves high time complexity but at the same time, it gives optimal results. To overcome the time consumption of Smith–Waterman algorithm, Lipman and Pearson proposed FASTA tool in 1985, which takes a given nucleotide or amino acid sequence and searches a corresponding sequence database by using local sequence alignment. It is based on heuristic method which contributes to the high speed of its execution. It is also a heuristic algorithm like FASTA but it is more time-efficient as it searches only for the significant patterns in the sequences with comparative sensitivity.
Due to the heuristic approach, the execution speed of BLAST is significantly increased but the amount of data being processed is very large and is increasing at an exponential rate; such as UniMES for metagenomic data sets [4], which continues to expand exponentially as next generation sequencing (NGS). However, the solution for the scalability problem of the BLAST has been solved by combining Hadoop and BLAST together, which is called HBLAST [5]. It is a hybrid “virtual partitioning” approach that automatically adjusts the database partition size depending on the Hadoop cluster size as well as the number of input query sequences. Yet another improvement in the speed of BLAST algorithm has been proposed in the form of GPU-BLAST [6] or other CUDA based implementations [7]. GPU-BLAST can perform protein alignments up to four times faster than the single-threaded NCBI-BLAST. When GPU-BLAST is compared with six-threaded NCBI-BLAST then it performs nearly two times faster.
Hadoop is an open source framework developed by Apache and it is a platform that can store and manage large volumes of data and it also provides a relatively easy way to write and run applications for massive data processing. Hadoop basically works in two folds, the first fold is Hadoop distributed file system (HDFS) [8], which is a good fault-tolerant system for storing data using low-cost hardwares. The second fold is MapReduce [9], which is a programming model to process the stored data by running tasks with efficient scheduling. Hadoop works in the fashion of Master–Slave architecture and it comprises many daemon processes like NameNode, DataNode, NodeManager, ResourceManager, etcetera. HDFS comprises of a NameNode and numerous DataNodes, wherein the NameNode is in charge of administration of metadata, file blocks and namespace of HDFS-stored files and the DataNodes are in charge of physically storing and managing the data on the block level.
Hadoop provides scalability and gives efficient handling of existing database of sequences and GPU provides time efficiency for the whole process. It is intuitive to merge these two prominent technologies to gain the full strength of existing technologies and provide better results in terms of time and storage. So this article presents HCudaBLAST, which is a combination of Hadoop and CUDA processing that runs the existing NCBI BLAST algorithm.
The rest of the article is organized as follows. In "Background and related work" section, we introduce related work available in this topic. In "Proposed work" section, we give the points of interest of our proposed work and algorithm of HCudaBLAST, and in "Experimental setup" section, we give description of our experimental setup, followed by conclusion in "Conclusion" section.