Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Proposal of Smith-Waterman algorithm on FPGA to accelerate the forward and backtracking steps

  • Fabio F. de Oliveira ,

    Contributed equally to this work with: Fabio F. de Oliveira, Leonardo A. Dias, Marcelo A. C. Fernandes

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Software, Writing – original draft, Writing – review & editing

    Affiliations Laboratory of Machine Learning and Intelligent Instrumentation, nPITI/IMD, Federal University of Rio Grande do Norte, Natal, Brazil, Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil

  • Leonardo A. Dias ,

    Contributed equally to this work with: Fabio F. de Oliveira, Leonardo A. Dias, Marcelo A. C. Fernandes

    Roles Data curation, Formal analysis, Methodology, Writing – original draft, Writing – review & editing

    Affiliation Centre for Cyber Security and Privacy, School of Computer Science, University of Birmingham, Birmingham, United Kingdom

  • Marcelo A. C. Fernandes

    Contributed equally to this work with: Fabio F. de Oliveira, Leonardo A. Dias, Marcelo A. C. Fernandes

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    mfernandes@dca.ufrn.br

    Affiliations Laboratory of Machine Learning and Intelligent Instrumentation, nPITI/IMD, Federal University of Rio Grande do Norte, Natal, Brazil, Department of Computer and Automation Engineering, Federal University of Rio Grande do Norte, Natal, Brazil, Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil

Abstract

In bioinformatics, alignment is an essential technique for finding similarities between biological sequences. Usually, the alignment is performed with the Smith-Waterman (SW) algorithm, a well-known sequence alignment technique of high-level precision based on dynamic programming. However, given the massive data volume in biological databases and their continuous exponential increase, high-speed data processing is necessary. Therefore, this work proposes a parallel hardware design for the SW algorithm with a systolic array structure to accelerate the forward and backtracking steps. For this purpose, the architecture calculates and stores the paths in the forward stage for pre-organizing the alignment, which reduces the complexity of the backtracking stage. The backtracking starts from the maximum score position in the matrix and generates the optimal SW sequence alignment path. The architecture was validated on Field-Programmable Gate Array (FPGA), and synthesis analyses have shown that the proposed design reaches up to 79.5 Giga Cell Updates per Second (GCPUS).

1 Introduction

In Bioinformatic, the analysis can be divided into three parts called primary, secondary, and tertiary analysis [1, 2]. The primary analysis is responsible for generating genomic data information from biological material. In the primary analysis, the sequencing machines create raw genomic data (or raw data). The raw data is composed of several genome reads. The secondary analysis involves reads alignment and trimming based on quality, and at the end of this step, a whole genomic is created. Finally, tertiary analysis can be characterized as interpreting results and extracting meaningful information from the data. In this last step, many algorithms and techniques can be applied. Also, many applications are created from these analyses. The tertiary analysis covers various applications, from genome characterization to a vaccine or drug treatment creation [2].

A large amount of raw data has been generated in recent years due to the replacement of Sanger sequencing by Next-Generation Sequencing (NGS), also called High-Throughput Sequencing (HTS) [3, 4], where NGS data analysis includes the three analysis categories mentioned above. Each sequencing machine can be created about 7 Tera base pairs (bp) per hour (Tbp/h) [5]. Intense sequencing of raw data from organisms and sharing data around the world are critical to monitoring major SARS-CoV-2 viral mutations and viral control [6]. Disease caused by the SARS-CoV-2 virus has been spreading worldwide and has been declared a pandemic by the World Health Organization [7, 8].

After sequencing reads, alignment methods can be performed to map and determine the evolutionary line of the targeted organism, such as its phylogeny. As a result, it is possible to understand the sample’s action mechanics by comparing them with cataloged samples in existing databases [2, 9]. The most used meta-heuristic alignment method for the sequences is the Basic Local Alignment Search Tool (BLAST) due to its fast processing speed and less memory usage than deterministic alignment algorithms [10].

However, different from the meta-heuristics, deterministic alignment methods offer the optimal alignment for a given input sequence instead of an approximate solution. The main deterministic methods are the Needleman-Wunsch (NW) and Smith-Waterman (SW) algorithms for global and local alignment, respectively [11, 12]. Nonetheless, a significant disadvantage of these algorithms is their slow processing speed and high memory usage due to the computational complexity. For example, SARS-CoV-2, commonly vary from 28k to 31k base pairs (bp) in size. Thus, performing thousands of large-size sequence alignments became a real challenge for extracting information on the raw data.

Thus, it is essential that the processing of algorithms associated with the bioinformatics area cover three critical requirements: high processing speed (high-throughput), ultra-low-latency, and low-power [1315]. Bioinformatics analysis algorithms are critically dependent on the computational infrastructure to cover high-throughput, ultra-low-latency, and low-power requirements. It can be said that there are three categories of infrastructure used, which are: Central Process Units (CPUs) [16], Graphics Processing Units (GPUs) [17, 18], and Custom Hardware Architectures (CHA) [1921].

Genomic analysis solutions associated with the High-Performance Computing (HPC), which is first (CPUs) and second (GPUs) categories of computational infrastructure, use systems based only on software that can be implemented using only CPUs and GPUs. According to results presented in [2224], these software-only approaches struggle to keep up with the growing computational demands of genomic analysis, given the barriers to reducing large-volume latency and power consumption using only CPUs and GPUs. In addition, as the number of nodes grows to handle increasing amounts of data, performance is not scaled linearly [16, 2527]. The third (CHA) category of infrastructure has been presenting itself as an exciting alternative to satisfy high-throughput, ultra-low-latency, and low-power requirements [2833].

Therefore, this work presents a parallel FPGA design with a systolic array structure to accelerate both the forward and backtracking stages of the SW algorithm. The main contributions are high-speed data processing implementation and low memory usage that allows for high scalability. According to [34], the systolic array is a class of parallel computing architecture that describes an array for dense linear algebraic calculations, proposed by [28]. Its hardware implementation usually uses a pipeline structure, where the data is propagated between Processing Elements (PEs). Besides, its main advantage is to reduce the number of memory accesses throughout the data flow. Hence, systolic arrays simplify the architecture and improve the system’s operating frequency [35].

To overcome the low-speed processing bottleneck and maintain the optimal alignment of deterministic algorithms, parallel hardware implementations for the SW algorithm have been proposed in the literature. The main platforms used are Field Programmable Gate Arrays (FPGAs), Central Processing Units (CPUs), and GPUs. FPGAs are widely known for their flexibility for parallelization and low-power consumption. An FPGA is a matrix of logic blocks that allows designing different circuits, such as processors, logic circuits, and even algorithm development [31]. FPGA platforms can be categorized as third generation computational infrastructure in bioinformatics, as it is a CHA. Also, the logical blocks within the FPGA are independent, allowing operations to be carried out in parallel and only one clock cycle, unlike CPUs that operate sequentially based on instructions and GPUs that require constant access to memories.

1.1 Related works

This subsection briefly discusses hardware-based approaches for the SW algorithm available in the literature. Usually, the SW is used for protein and DNA sequence alignment, but their parallel strategies are quite different. Many hardware implementations targeting protein alignment were developed on supercomputers, CPUs, GPUs, and FPGAs, as can be seen in [3640]. Due to an extensive number of previous works in the field, we consider the DNA alignment implementations, which in turn are commonly based on Resistive Content Addressable Memories (ReCAMs) [22], Application-Specific Integrated Circuits (ASICs) [41], and FPGAs [23, 24, 39, 4253].

GPUs are well-known for their high degree of parallelism and computing intensity. However, high-performance GPUs have a high cost, and, broadly speaking, GPUs have significant computing latency, and low energy efficiency compared to custom hardware, but with more energy efficiency than CPUs, as seen in the results [22, 23, 40]. The high computing latency is due to the high number of cores and low cache memory to control these cores. In contrast to GPUs, FPGAs are customizable according to the user’s needs, achieving better computing performance and lower latency [15, 54, 55]. However, FPGA hardware development is usually complex and takes a long time.

Unlike the conventional platforms previously mentioned, the SW algorithm has also been implemented on ResCAMs, as can be seen in [22]. ResCAMs is a storage accelerator system that allows millions of processing units (PUs) to be deployed over multiple silicon arrays. In [22], the implementation was used to compare the homologous chromosomes between humans (GRCh37) and chimpanzees (panTro4), and the only SW step performed was the building of the score matrix. As a result, their proposal achieved 5, 300GCUPS, a 4.8× speedup over the GPU performance. Besides, it also had a 1.7× better energy efficiency compared to an FPGA implementation.

In [45], a hardware/software co-design was implemented on FPGA to reduce the execution time of short-sequence alignment during genome sequence analysis. The analysis was performed using the Shouji method, a highly parallel and accurate pre-alignment filter that reduces the need for computationally-costly dynamic programming algorithms. The FPGA was used to boost the algorithm’s performance. As a result, integrating the Shouji and aligner methods reduced the alignment total execution time by up to 18.8×.

In [48], a systolic-array-based architecture is presented as a DNA sequencer, using the SW algorithm affine gap penalty score. According to [48], the SW algorithm was implemented on a Xilinx Virtex-6 FPGA, reaching 465 Giga Cell Update per Second (GCUPS), and reducing the area occupation by 90% compared to other architectures.

In [50], an FPGA runtime accelerator is presented to align data information sequences. The authors’ implementation is based on the seed-and-extend model of Bowtie2, achieving a similar alignment rate while mapping reads by ≈2× faster. Meanwhile, it is proposed by [46] an FPGA implementation of the SW algorithm to replace the GPU used in [40], called SWIFOLD. The FPGA implementation is based on OpenGL and utilized for long DNA sequences alignment. The SWIFOLD approach for accelerating the SW alignment reached an average of 125 GCUPS.

In [41], an ASIC design for traceback recording with penalty scoring is proposed. The design is implemented on a TSMC 40nm technology, and the aligner strategy can speed up pairwise alignment by 71× compared to the CPU.

Meantime, in [47] is presented an FPGA approach to meet the alignment operation processing requirements. The paper introduces a register-file concept to reduce run-time storage, and it does not require any sorting or comparison operations to prepare for the final sequence alignment. As a result, a 128 GCUPS performance was achieved using 256 PEs.

In [56], an FPGA hardware implementation of SW and NW algorithms is presented. The performance and area occupation is evaluated for different hardware designs. Besides, a convolution neural network model is introduced to implement the NW algorithm, achieving 98.3% accuracy.

In [23], a heterogeneous FPGA architecture for sequence alignment is proposed. Unlike most of the works in the literature, their implementation aims to accelerate the entire SW algorithm with the backtracking process. For this purpose, the architecture can process long strings of data based on parallelism and partitioning strategies; and the backtracking process was performed by dividing the equal parts of the similarity matrix, while the search started from the lower right sub-matrix. The tests were performed for 512 Processing Elements (PEs), reaching 76.8GCUPS at 150MHz and 105.9GCUPS (with external memory) for 200MHz. As a result, a speedup of 3.6× to 25.2× was achieved regarding other SW designs implemented on FPGA and GPUs. Besides, it reached a 26% reduction in power consumption compared to the GPU implementation.

Similarly, more FPGA approaches using systolic arrays for the NW and SW with backtracking sequencing techniques have been proposed, such as [39, 57]. In [39], a VHDL SW implementation, using Dynamic Programming (DP) with approximation correspondences for two different strategies, was proposed. It achieved 23.5GCUPS with speedups between 150× to 400× compared to a 2004-era PC. Meanwhile, in [57], the implementation was based on PEs to perform elementary calculations and a diagonally backtracking search, also developed in VHDL. Comparisons were made with the linear and affine strategies, achieving 10.5GCUPS.

In [44], the SW forward and backtracking processes were implemented in an FPGA. The Qnet structure was adopted for communicating with the FPGA, reaching 25.6GCUPS, a speedup of 300× compared to a desktop computer.

Therefore, it can be noted from the literature that the key points for a high-performance SW implementation on FPGA are the operating frequency and number of PEs, which in turn are associated with the hardware capacity and design critical path. Thus, we present an FPGA implementation for the SW algorithm using systolic arrays, as in [23, 39, 44, 57].

Our approach performs both the forward and backtracking stages of the algorithm. Unlike the approaches in the literature, we obtain the alignment path distances during the forward stage processing and the maximum score, reducing the complexity of the backtracking stage processing. Memories are used to propagate the distances and maximum score, allowing the backtracking step to follow the path directly. Thus, our architecture achieves good performance (short critical path), reduced memory usage and, theoretically, high scalability, and prevents memory access overlap latency, even implementing the two stages of the SW algorithm. It is essential to mention that our proposed hardware architecture can perform the alignment of any sequence length. However, the resources available in the target FPGA are a limiting factor for the size of the score matrix, as shown in [23, 47, 58]. Nonetheless, we provide a proof-of-concept and an actual implementation on the Virtex-6 FPGA using a synthetic dataset.

2 Smith-Waterman algorithm

Smith and Waterman originally proposed the SW algorithm in 1981 to performs local sequence alignment of nucleotides and proteins in the biological field [12]. The sequence alignment of the SW algorithm includes the forward and backtracking stages, which are performed by the calculation results of the alignment similarity score. Besides, the alignment is performed based on two input sequences called query sequence, q, and dataset sequence s. The query sequence can be expressed by (1) where qj is the j-th nucleotide or amino-acid protein and N is the length of the query sequence. The dataset sequence can be expressed by (2) where si is the i-th nucleotide or amino-acid protein, and M is the dataset sequence length. Therefore, the SW algorithm is calculated iteratively for two dimensions, and it has a computational complexity of O(M × N).

The forward stage calculates the scoring matrix, H, where H is a two-dimensional array that can only take values greater than or equal to 0 (i.e., ). This matrix is generated by comparing the elements of the sequences q and s. Usually, H is generated using DP, and it is initialized with zeros in the first row and column. Subsequently, the DP process is performed to calculate the sequence scores. Based in works presented in [23, 36, 59], the recurrence relationship can be defined as (3) where , P is the score matrix used for obtaining the similarity score between si and qj, E and F are two assisted matrices when calculating matrix H, ρ is the gap opening penalty and σ is gap extension penalty. In the particular case of ρ = σ, a linear gap penalty model is obtained, opening and extending a gap with the cost γ. P is also called a substitution matrix, where the simplest version is when the diagonal receives the match value and the rest of the matrix has a mismatch value. When performing all element calculations, this expression is the HM, N matrix. Therefore, H(i, j) is the maximum alignment score of two sub-sequences s and q. The initialization condition is (4)

The maximum score value of H(i, j) in the forward stage is the last sequence that will be aligned. To determine the relationship, the previous neighborhood values of the analyzed element are required, i.e., the values on the diagonal, horizontal, and vertical positions, as illustrated in Fig 1. As can be observed, the score of w can be found based on its neighborhood (x, y, v), which is H(i − 1, j − 1), H(i − 1, j), H(i, j − 1), respectively. This windowing step occurs throughout the process of determining all scores in H.

thumbnail
Fig 1. The direction of the score computation in the matrix during the SW forward stage.

To determine a score, such as w, the neighborhood values (x, y, and v) have to be known. The green-colored cells indicate already computed values, while the yellow cells indicate that the values to be calculated.

https://doi.org/10.1371/journal.pone.0254736.g001

As shown in Fig 1, the neighborhood values x, y and v, must necessarily be known to determine the value of w (i.e., H(i, j)). For this purpose, those values are defined based on the sequences q and s. Thereby, the w score is determined as (5) where γ, α, and β represent the linear gap, a match, and a mismatch, respectively. A gap is a penalty that causes an empty element in the sequence (represented by a dash symbol), while the other sequence continues. It can result from the query or database sequence. The Eq 5 is equivalent the Eq 3, where x+ (αβ) = H(i − 1, j − 1)+ P(si, qj), y+ γ = F(i, j) and v+ γ = E(i, j). Finally, when fully populated, the H matrix contains the score and path information.

The backtracking stage starts after determining all the scores in the H matrix, i.e., calculating the score of all cells H(M, N). Hence, the backtracking begins at the cell with the highest value in the H matrix (maximum score) and trace-back the next position based on the highest neighborhood value, according to Eq 5, which can be on the diagonal, horizontal, or vertical direction. This is an iterative process that repeats until it reaches the limit value, usually set to a score of 0. Also, a directional flag indicates the path. Finally, the backtracking path determines the best local alignment. The diagonal direction points to a match in the alignment, while the horizontal and vertical directions indicate gaps which are represented by dashes in the s and q sequences, respectively.

3 Implementation description

The hardware architecture for the SW algorithm proposed in this work was developed using systolic arrays to input two DNA sequences and increase the processing speed of the local sequence alignment. An overview of the systolic array structure of the proposal for N PEs is shown in Fig 2. Besides, each PE is divided into 3 modules. These modules are the forward stage, the storage process, and the backtracking stage, as seen in Section 2. Each module is illustrated in blue, green, and yellow, respectively. The forward stage has its module named as Matrix Score Module (MSM), the storage process module is called as Memory Module (MM), and the backtracking stage has its module as backtracking stage (BS).

thumbnail
Fig 2. General architecture for the SW algorithm.

The forward stage (MSM) is represented by the blue block, the backtracking stage (BS) by the yellow block, and the Memory (MM) by the green block. Only external signals are displayed, i.e., the q and s signals.

https://doi.org/10.1371/journal.pone.0254736.g002

The labeled signals shown in Fig 2 are generated outside the modules. Meanwhile, the non-labeled ones are generated by computations inside the modules and detailed throughout this Section. The sequences q and s, defined according to Eqs 1 and 2, are external discrete signals used as inputs of the SW algorithm. Furthermore, each signal in the sequences represents one of the four DNA nucleotides, i.e., A, G, T, or C (also accepting twenty distinct levels referring to amino acids or another set of sequences). The design proposed supports any sequence set as in any SW algorithm, but for an efficient alignment, it is necessary to adopt a suitable scoring matrix that models each possible symbol’s frequencies that can occur in the sequences.

Initially, the circuit starts when the MSM modules propagate the q and s signals. As seen in Fig 2, each k-th element of the q and s sequences are shifted to each MSM output to shorten and stabilize the critical path, as well as allowing the computation of scores synchronously, preserving the systolic array structure. Afterward, the MSM computes the score according to Eq 3, and propagates the sequence elements to the next MSM; also, the computed results are sent to the respective MM in their order of entry. During this process, the MM operates exclusively in writing mode while the process has not reached the last computation between the two sequences.

The forward stage is completed after fully computing the scores of the H matrix. Also, the last MSM enables the backtracking process. Consequently, the MM switches to the read mode, and the BS reads the data computed by its respective MSM. The alignment starts from the calculations performed in the MSM. Then, from the respective defined PE in the forward stage, the process starts and ends according to the definitions of the SW algorithm.

Fig 3 shows the block design that represents each PE of the systolic array, with a detailed description of the signals between the modules within one PE. As can be observed, besides the two input sequences to be compared, q and s, the MSM also receives an enable signal, e n. After computing the score between each k-th element of the two sequences (i.e., an element of the H matrix), the MSM outputs to the next PE the following signals: the calculated score, Scj; the maximum score, MaxVal, and its position, AddrRAMij; the PE index; along with the input signals q, s, and en, shifted in time. In addition, the MSM also outputs signals to the MM, which are the calculated path direction, Direction, and the storage address of that path wAddrDir.

thumbnail
Fig 3. Architecture of each PE in the systolic array.

The forward stage is represented by the blue block, the backtracking stage by the yellow block, and the Memory by the green block.

https://doi.org/10.1371/journal.pone.0254736.g003

Subsequently, after fully populating the H matrix and, consequently, the D matrix, the forward stage is finished enabling the Traceback signal, which in turn begins the BS. Firstly, the BS sets signal BTStart to 1, indicating the start of the backtracking process. Therefore, the mRAMij are propagated back until it reaches the BS with maximum score, which is identified by the signal index. From this location match, the btcontrol signal is changed to allow the reading of the memory by MM. Thus, the BS receives the path value from the MM at signal dj when sending the memory address rAddrDirj signal. The dj value allows the BS to calculate the next requested address and propagate it to the next module through the path(j) signal, representing the memory address of the request path in MM. Lastly, the alignment value enters valDir, and the process continues until it reaches the complete alignment. All modules are detailed in the following subsections. All signals present in this Section are shown in Table 1.

thumbnail
Table 1. Description of signals and the algorithm stage they are used.

The forward stage is represented by F, storage stage by F, and backtracking stage by B. They are shown in the Figs 4, 7 and 8.

https://doi.org/10.1371/journal.pone.0254736.t001

3.1 Forward approach

Firstly, based on the principles “divide and conquer” for solving computational problems, we propose a matrix used to store only the values of the recursive path, called the D matrix. The D matrix is not widely used in the SW literature. However, it is important to achieve a solution at lower-level programming. Besides, a matrix with two different types of information, such as the H matrix, increases the hardware design complexity. Matrix D needs to store only 4 levels of values which are: 0, 1, 2 and 3. Each element of the matrix D needs 2 bits to be expressed, delivering a more economical storage process compared to H, which can certainly need more than 2 bits to represent each element.

As previously mentioned, the alignment process is performed based on the query and dataset input sequences, q and s, respectively. Also, there can have different sizes, represented by N and M, which define the size of the matrices H and D, respectively. The Matrix Score Module (MSM) calculates the scores and distances in columns of matrices H and D in parallel.

The systolic array structure developed for the matrices is composed of N PEs. Therefore, for each j-th element in q, there is a j-th PE. It is based on dividing the construction of the H score matrix expressed by (6) and finding the best path in which the D matrix returns the correct sequence alignment, which in turn is equivalent to the directional flags that determined the alignment path. Moreover, for each PEj (which represents a column of the matrix H) there is i-th s(i) that varies from 0 to M − 1, according to the following (7)

The number of MSMs submodules corresponds to the number of elements in q, i.e., , as can be observed in Fig 4. Therefore, H is formed by N columns, according to Eq 6. Besides, the MSM also calculates the path, the maximum score value and its position, which are subsequently stored in the Memory Module (MM).

thumbnail
Fig 4. Hardware representation of the H score matrix on the forward stage.

The modules are generated from 0 to N − 1.

https://doi.org/10.1371/journal.pone.0254736.g004

The SW algorithm in this work is initialized by the en(k) signal, which enables the memory components in the MSM and MM modules to allocate the two sequences q(k) and s(k). The en(k) is a sequence of pulses of value 1 with size equal to the s sequence. Thus, the sequences are transmitted at each sampling time to the forward module. The signals are received in MSM, and the respective q(k) is allocated according to its position, while s(k) is propagated to the MSM based on the internal counter within each module. The counters within each MSM module are activated with each pulse of the en(k) signal.

Each k-th q element is compared to all elements in s, iteratively, i.e., the s is traversing, going from the first PE to the last one. If the values are equal, a value from the Match constant is propagated; otherwise, the value of Mismatch is propagated. Match corresponds to a reward for similarity, while Mismatch is a penalty for inequality between values. Afterward, the addition block sum the values according to (8) where α and β are arbitrary values that correspond to the match value and mismatch values, respectively.

Subsequently, the score value, Scj−1(i − 1), and correspondence value, αβ, are added to define a portion of gj(i). The Scj−1(i − 1) value is equivalent to the H(i − 1)(j − 1) value (i.e., gj−1(i − 1)). The values of Scj(i), MaxValue(−1), AddrRAMij(−1) and index(−1) are initialized with 0. At the same time, the Scj−1(i), which is the score value of the previous block, it is received and operated with the value of Gap. In addition, the value of the scoring operation of this block in the previous time, Scj(i−1), is also operated with the Gap. Thus completing the computation of gj(i) that can be seen in the Eqs 5 and 9.

Fig 5 shows the submodule that constitutes each MSM module. The three blocks in pink are used to perform the addition and subtraction operations, representing the SW’s relations to generate the M elements. Thereby, the process of choosing the maximum value among the calculated scores is carried out based on Eq 5 as follows (9) where γ is an arbitrary value that represents the chosen linear gap value. This expression is equivalent to Eq 3.

The output of the pink blocks, called opr, are propagated to the next submodule for choosing the maximum score and distance path, as shown in Fig 5. This submodule is built with a set of multiplexers and relational circuits that can find the maximum score value with the coded distance of the path by comparing the opr signals, as seen in Fig 6.

thumbnail
Fig 5. Submodules that constitute a Matrix Score Module.

The representation of the circuit and signals is only related to the forward stage.

https://doi.org/10.1371/journal.pone.0254736.g005

Selecting path distances is based on a simple encoding of three levels representing the alignment action to be adopted: 2, 1, and 3. Therefore, the levels 2, 1, and 3 represent a match, a gap in the target sequence q and s, respectively, as described in Section 2. The encoding process of directions is performed in the forward step, as illustrated in Fig 6. During this process, the same signals used to calculate the H score matrix are needed, i.e., the oprj−1(i−1), oprj−1(i) and oprj(i − 1), as seen in Fig 5. These values are compared in relational circuits and subsequently chosen according to the criteria of the SW, as seen in the Fig 6.

thumbnail
Fig 6. Circuits that constitute the submodule for finding the maximum score and distance path within an MSM.

The relational circuits are represented in purple and the multiplexers in yellow.

https://doi.org/10.1371/journal.pone.0254736.g006

Then, for demonstrating the realization of the path coding process is done, the information in Fig 1 is used. When looking at the Fig 1, four variables are distributed in an H score matrix. The variables x = H(i − 1, j − 1), y = H(i − 1, j) and v = H(i, j − 1) are known values, while w is a score to be computed. Starting from w = H(i, j) as the observed cell for determining a generic path and x, y and v as the neighborhood. An integer value is associated with the dj corresponding to the address of w, according to the maximum value determined in the neighborhood, these values are assigned according to the expression (10) where 1, 2 and 3 is the vertical, diagonal, and horizontal paths, respectively. The Eq 10 is equivalent to the circuit implementation illustrated in the Fig 6, where (x+ (αβ)) = oprj−1(i−1), (yγ) = oprj(i−1) and (vγ) = oprj−1(i). Besides, (αβ) = α for a match and (αβ) = β for a mismatch.

Algorithm 1: SW foward stage pseudo-code based in structure this proposal

Input: query sequence q

Input: dataset sequence s

Output: distance path matrix D

Output: row position of maximum score posMi

Output: column position of maximum score posMj

//length query sequence N, length dataset sequence M, match value α, mismatch value β and linear gap value γ;

for for k = 0 to M × N step 1 do do

 Initialize the DP matrix H and D with zeros;

end

//Forward Stage;

for for j = 0 to N − 1 do

for for i = 0 to M − 1 do

  if q(j) = s(i) then

   sel ← α;

  else

   sel ← β;

  end

  //H(i + 1, j + 1) computation;

  x = H(i, j) + sel; y = H(i, j + 1) + γ; v = H(i + 1, j) + γ;

  score ← 0, direction ← 0;

  if x > yx > v then

   scorex; direction ← 2;

  else

   if y > v then

    scorey; direction ← 1;

   else

    scorev; direction ← 3;

   end

  end

  // Stores the score in matrix H and the direction in matrix D calculated;

  H(i + 1, j + 1) = score; D(i + 1, j + 1) = direction;

  // checking which is the highest calculated score;

  if maxVal < score then

   maxValscore; posMii + 1; posMjj + 1;

  end

end

end

return D, posMi, posMj;

After the process of selection the score and direction, it has the choice of the maximum score based on a logic of multiplexers and relational blocks. There is a counter, called cntR, to determines the number of times that the selection of the score and direction is carried out, i.e., the H matrix row that the process is on. This is necessary to determine the AddrRAMi address. At the beginning of MSM processing, index(j − 1) is added to 1, just once for each MSM, becoming index(j) and determining the address of this MSM. For the determination of Maxval, it is seen whether the previous value is less than the current computed score value, then the calculated current score value becomes the Maxval, AddrRAMj = index(j) and respective row process value is AddrRAMi. It is noted the AddrRAMij signal are corresponding to the location of the maximum score value.

In parallel with the process of determining the maximum score value, there is the process of storing the directions. Thus, the output Direction of the submodule is prepared in set with the value wAddrDir, which comes from the H matrix row calculated at that moment, allowing to write in order in RAM memory according to the respective positions of H matrix (i.e., same position of D matrix).

Finally, according to the systolic structure, after the MSM processing is over, the signals are parallelly sent to the next MSM. Thereupon, q(k), s(k), and en(k) are shifted in time, that is, q(k − 1), s(k − 1), and en(k − 1), to match the calculation structure of the H matrix, as seen in the Fig 4. Besides, the calculated signals Sc(i), MaxVal, AddrRAMij and index are also propagated to the next MSM to preserve the scores calculating structure. This process repeats until the last element of s is calculated with the last element of q; a counter in is used to determine that moment since the values of the sequences are previously informed to all PEs. The forward stage finishes with the calculation of the last element of the matrix, i.e., H(M − 1)(N − 1). Consequently, the signal Traceback is enabled, indicating the end of the process in all MSM, and the addresses AddrRAMij corresponding to the maximum score value is sent to the next step (i.e., backtracking process).

Algorithm 1 presents the SW pseudo-code for forward stage and storage process structures. The Algorithm 1, is prepared to perform the calculation of scores and storage of matrices H and D. The input is the signals q and s, which is Eqs 1 and 2, respectively. The first loop, in the Algorithm 1, represents each N element used, as seen in Fig 2. The second Loop is the interactions made by the signal En to allow the calculation of each element of s in each PE. The first conditional structure is the multiplexer for making choices in the MSM, as seen in Fig 5. Submodule Selection Maximum Value and Direction, Fig 6, is represented by the second conditional structure, which compares variables x, y and v. The outputs are D matrix stored in MM and the position of the maximum values defined in MSM.

3.2 Memory Module (MM)

The MM communicates with both the MSM and the BS, as shown in Fig 7. During the forward stage, the data regarding the distance values are written to the MM. Meanwhile, during the BS, the memory addresses to align the sequences are fetched from the MM. The size of each memory is defined by the size of the s sequence; also, there is a flag to indicate that the memory is in write mode while computing the H matrix and, subsequently, in fetch mode, in the backtracking process.

thumbnail
Fig 7. Representation of the simplified Memory Module structure.

This model is practically as is the complete processing PE of each column of the H matrix.

https://doi.org/10.1371/journal.pone.0254736.g007

The MM consists of Random Access Memories (RAMs) used to store the path directions, Direction, obtained in the MSM that is thereafter needed in the BS module. Hence, the RAMs are in write mode throughout the forward stage and reading mode during backtracking. The RAM input ports are the address and data busses and write enable mode. Besides, the memory size of each memory is defined based on the size of the sequence s, which in turn, the amount of RAM memories is equal to the number of PEs in the systolic array.

The enable signal, en, is used as write enable for each RAM in the MM. Therefore, en = 1 defines the write mode, while en = 0 the read mode. In addition, the btcontrol signal selects which module controls the RAM address bus. Hence, for btcontrol = 0 the memory addresses are defined by the MSM module through wAddrDir signal, while btcontrol = 1 selects the BS module to define the addresses via rAddrDir signal.

Thus, in write mode (en = 1 and btcontrol = 0) the wAddrDir signal defines the address of the RAMs where the Direction value is stored by the MSM. Subsequently, after the H matrix is fully calculated, the Traceback is enabled to indicate the end of the forward stage, and the MM goes into reading mode (en = 0 and btcontrol = 1). Accordingly, the rAddrDir signal defines the address space the BS fetches the data corresponding to the value reported by the trace-back.

3.3 Backtracking approach

The backtracking process starts when the Traceback signal is enabled in the MSM by counters that determine the last PE and the last processed element of s, as described in forward stage. As previously mentioned in subsection 3.1, the MSM propagates to the MM the maximum score address that is used as the starting point for alignment, as shown in Fig 8. Meantime, the Fig 9 details the submodules used to create each BS module. The submodules in green are circuits for controlling and synchronizing all signals during the module operation, while the blue submodule performs the alignment path described in this section.

thumbnail
Fig 8. Backtracking Module structure in the FPGA.

The operation of this block starts after the forward step.

https://doi.org/10.1371/journal.pone.0254736.g008

thumbnail
Fig 9. Submodules that constitute the backtracking stage module.

The green submodules represent the control submodules, while the blue submodule represents the circuit that performs the alignment.

https://doi.org/10.1371/journal.pone.0254736.g009

Firstly, after Traceback is enabled, the BTStart signal is enabled, and the addresses of the maximum score element, AddrRAMi(N − 1) and AddrRAMj(N − 1), are sent to the respective BS. Also, the values of AddrRAMi(N − 1) and AddrRAMji(N − 1) are assigned to mRAMi(N − 1) and mRAMj(N − 1), respectively, by the BT Enable submodule. It is important to emphasize that if the mRAMj(N − 1) value (i.e., AddrRAMj(N − 1)) is not already in the BS PE, it will trace-back by checking the Memory Index submodule. This process happens until it reaches the PE corresponding to the maximum score location. Afterward, the Memory Index submodule assigns mRAMi value to rAddrDir to read the memories in the MM, which in turn, returns the d(i) value to the Direction Process submodule, as can be seen in Fig 9.

Secondly, the alignment process starts. The circuits used to build the alignment submodule are shown in Fig 10. As can be observed, the input dj(i) is used as the multiplexer selector to perform the Eq 10. Therefore, for dj(i) = 3, BS remains in the same memory position and moves back one BS module, i.e., horizontal displacement. While for dj(i) = 1, only the memory position decreases by 1, and BS is verified by the Direction Process and Continue Processing submodules (i.e., vertical displacement). Meanwhile, for dj(i) = 2, the memory position also decreases by 1, and it moves to the previous module with the displacement in the memory position. The circuit after the first multiplexer prevents negative addresses in the memory.

thumbnail
Fig 10.

Logical circuits used to build the Alignment Block submodule.

https://doi.org/10.1371/journal.pone.0254736.g010

Given that the path to align the first element is found, the Alignment Block submodule receives the rAddrDirj and dj(i) signals to define the path to be followed by the next BS, as seen in Fig 9. Initially, a logical circuit enables the BT Start and Direction Process submodules to propagate those signals to the Alignment Block. The Direction Process and Continue Processing submodules carry out checks to define which BS module is active, that is, for dj(i) = 1, the data processing is held in the current BS module, and for dj(i)≠1, the signal BTNext is enabled, indicating the end of data processing in the current PE to start in the next one.

After finding the module for the maximum score, the mRAMi and mRAMj signs finish their function. Thus, from the determination of the BS with the maximum score, the path(j) sign is used as a guide for locating the alignment of each module. Then, the data in MM is requested and the dj value is returned for verification and establishment of alignment. The verification and establishment of the alignment path is done by the Memory Index, Direction Process, and Continue Processing submodules. Decisions related to dj value are made in Alignment Block submodule, as illustrated in Fig 10.

Finally, the Finish Processing and Continue Processing submodules finish the data processing in the module. Thereby, the valDir output of each submodule is used to construct the alignment path, along with the maximum score position values. The trace-back continues until it reaches PE0 or finds a path direction with a value of 0.

Algorithm 2 presents the SW pseudo-code for backtracking stage this proposal. The backtracking stage, algorithm 2, is ready to perform the alignment in a list using the path informed in D, starting from the positions of the maximum score, as seen in this Section. Inputs for this step are provided by Algorithm 1. The loop for this step represents all backtracking stage modules from N − 1 to 0. The conditional structure of Algorithm 2 is the representation of submodule Alignment Block, Fig 10, which allows it to trace-back. And the return of the alignment path is storing the data, valDir, in RAM memory.

Algorithm 2: SW backracking stage pseudo-code based in structure this proposal

Input: query sequence q

Input: dataset sequence s

Input: distance path matrix D

Input: row position of maximum score posMi

Input: column position of maximum score posMj

Output: alignment sequences list A

Output: alignment path sequences list path

//Backtracking Stage;

auxi = posMi; auxj = posMj,; auxD(auxi, auxj);A ← [];

pathconcat(path, aux);

while aux > 0 do

Aconcat(A, [q(auxj − 1); s(auxi − 1)]);

if aux = 2 then

  auxi = auxi − 1, auxj = auxj − 1;

else

  if aux = 3 then

   auxj = auxj − 1;

  else

   if aux = 1 then

    auxi = auxi − 1;

   else

    break;

   end

  end

end

auxD(auxi, auxj);

pathconcat(path, aux);

end

return A, path;

4 Results and discussion

This section presents the synthesis results for the architecture described in the previous section and analyses it regarding the following key points: critical path, operation frequency, number of PEs, and performance. The performance measures the time to calculate an element of the scoring matrix.

The development of the algorithm was carried out using the development platform provided by the FPGA manufacturer, in this case, Xilinx [60]. This platform allows the user to develop circuits using the block diagram strategy instead of VHDL or Verilog. The architecture was deployed on the FPGA Virtex-6 XC6VLX240T and compared to state-of-the-art works. Usually, hardware implementations of the SW algorithm in the literature were implemented only the forward stage or both the forward and backtracking stages. In our proposal, both stages were implemented.

The performance for hardware implementations of the SW algorithm is usually measured in Giga Cell Update Per Second (GCPUS), which in turn is defined as (11) in which a cell can be a matrix element to be computed. This metric can also be described based on the clock frequency, that is, (12) The latter equation is often used to compare the systolic array efficiency. Since the number of cells is equivalent to the number of PEs, and the clock frequency defines the operating frequency, it is unnecessary to measure the total runtime of the algorithm.

4.1 Hardware architecture validation

To validate the architecture proposed in this work, the sequences q and s were randomly generated and varying the match, mismatch, and linear gap values. Initially, the analysis was carried out for 8 PEs and by varying the size of the sequences q and s from 8 to 32. The number of PEs also varied according to the size of q. Our architecture works with sequences of varying lengths, only requiring that the length of s is greater than or equal to the q length.

Firstly, the correctness of the matrices H and D was verified by monitoring the MSM outputs, such as Sc and Direction, as described in Section 2. Secondly, it was verified if the D matrix elements were stored in the correct memory positions in the MM. Lastly, the operation of the BS modules was also verified by monitoring the path(j + 1) bus and the Memory Index submodule.

Following, the Alignment Block and Direction Process are observed to check if the memory accesses are in accordance with the path(j + 1) value, that is, according to Eq 10. Also, the Finish Processing and Continue Processing submodules are monitored to verify the values propagated for a match (2), horizontal gap (3), and vertical gap (1).

The data bit-width was defined by the maximum size of the input sequences, limited by FPGA memory capacity. Hence, the input sequence bit-width was set to 3 while constants were defined according to its value. Besides, the bit-width for the MSM buses that perform mathematical operations was defined as logtotalPEs × α. Meanwhile, the sequence counters for s is logssize.

Fig 11 shows the architecture deployed and running on the Virtex-6 FPGA. The host computer (i7-3632QM CPU and 8GB of RAM) was used to plot the results and compare them to a software implementation presented in [61], as shown in Fig 12. In the Fig 12, it can be seen that the y axis refers to the s sequence, while the x axis refers to the q sequence. To increase the resolution of the image, only the parts of the sequences that are aligned are used, where the position at which the alignment starts and the maximum score value are shown in the title of the illustration. The value of Row refers to the position in the s, whereas Column is related to the element of the q. The amount of sequence alignment performed is represented by Number of Alignments.

thumbnail
Fig 11.

Photo of the hardware architecture deployed on the Virtex-6 FPGA and the host computer used to plot the results.

https://doi.org/10.1371/journal.pone.0254736.g011

thumbnail
Fig 12. Illustration of the results obtained from our proposal in co-simulation.

The image is the most detailed representation of the monitor in Fig 11. It can see that the y-axis refers to the s, while the x-axis refers to the q. The position at which the alignment starts is indicated by Row and Column. The maximum score value found is presented by Maximum Value. The amount of sequence alignment performed is represented by Number of Alignments.

https://doi.org/10.1371/journal.pone.0254736.g012

The architecture parameters for the demo were set to match = 5, mismatch = −5, gap = 1, and 128 PEs. Hence, the size of the sequence q is also 128. Meanwhile, the size of the sequence s was set to 8, 192, resulting in a total of 1, 048, 576 calculated cells. Sequence q is loaded into memory at each iteration, where it can vary between 4 different 128 nucleotide sequences in the demonstration. The demo is available at [62], and the implementation source code is available at [63].

The SW architecture was developed using the Xilinx System Generator on Matlab, and the traffic of data between the host PC and the FPGA was accomplished via Ethernet protocol. Moreover, we added on the FPGA a buffer to store data and developed a manager circuit to control the data flow. Therefore, the q and s sequences were transferred to the FPGA (via Ethernet protocol) and stored in the buffer, and, subsequently, fed into the SW architecture to perform the alignment by the manager circuit.

4.2 Synthesis analysis

Analysis of the synthesis results for the SW hardware implementation were carried out for two FPGAs: Virtex-6 XC6VLX240T and Virtex-7 XC7VX485T. Table 2 presents the hardware area occupation and frequency for a different number of PEs. The size of the input sequences were defined according to the number of PEs.

thumbnail
Table 2. Area occupation results based on the FPGA synthesis of our SW implementation, with forward and backtracking stages.

https://doi.org/10.1371/journal.pone.0254736.t002

The critical path of the design was ≈8.34ns and ≈6.44ns for the Virtex-6 and Virtex-7, respectively. Therefore, the maximum clock frequency was 120MHz for the Virtex-6 and 155MHz for the Virtex-7. Regarding the FPGA area occupation, increasing the number of PEs also increases the hardware resources used. For 512 PEs in the Virtex-6, a total of 68% of the Slice Look-Up Tables (LUTs) were used in contrast to only 7% for 64 PEs. Concerning the frequency, a slight decrease is observed as the number of PE increases due to an increase in the critical path. Concerning the Virtex-7, there are unused FPGA resources as less than 35% of Slices LUTs were used. Therefore, it can be used to increase the number of PEs and, thus, the performance. Note that, increasing the number of PEs and, consequently, the size of the sequence q, the number of parallel computations will also increase, thus, improving the performance. Therefore, according to the resources available in the target hardware, our architecture can operate with a number significantly bigger than 512 PEs.

4.3 Comparison with other works

Comparisons with state-of-the-art works were also performed. The performance of systolic array-based implementations increases with the number of PEs. Hence, the comparisons were carried out for the maximum number of PE in each proposal. We compared our design to the most relevant and recent works similar to our proposal, i.e., the SW algorithm has to be implemented using a systolic-array structure, deploy the backtracking step, and provide the parameters concerning processing time and area occupation. Given that, we discuss a direct comparison to the architecture presented in [23, 41, 47]. The remaining works shown in Table 3 illustrate general results of other FPGA implementations of the SW.

The works presented in Table 3 are available in [23]. The second column indicates whether the backtracking stage was also developed on FPGA or only the forward. Meantime, the third to fifth columns present the number of PEs, operating frequency, and performance, respectively. The performance was obtained according to Eq 12. As can be seen, our approach and the one proposed by [23] were the only ones to implement a high number of PEs. However, in [23] only the backtracking path was deployed on the FPGA, and a submatrix structure is used to load the path chosen for alignment. Meanwhile, our architecture relies on a memory storage structure and the definition of the maximum score to align the sequences. The architecture proposed by [47] achieved the best performance, as can be seen in Table 3. However, it uses a register-file concept instead of systolic-array. Therefore, due to the similarity of hardware techniques used to deploy the SW algorithm, we discuss a result comparison with [23], which achieved the second-best performance.

Furthermore, a comparison with [23] was also carried out regarding the FPGA area occupation, and it is presented in Table 4. The second and third columns present the FPGA and the number of PEs used, respectively. Meanwhile, the third and fourth columns present the slices and memory blocks occupied, and the fifth column the operating frequency.

thumbnail
Table 4. Table with the summaries of the results of the FPGA synthesis works of SW implementation (hardware SW with backtracking step).

The Slice column is related to the logical distribution and refers to the occupied slices in the synthesis.

https://doi.org/10.1371/journal.pone.0254736.t004

As shown in Table 4, for the same number of PEs, our architecture occupied 35, 286 slices and 0 BRAMs in contrast to 57, 870 slices and 896 block RAMs (28 Mbits memory) in [23]. Also, the total area occupation was higher than 60%, compared to 46% on ours, due to the substitution matrix. Therefore, our proposal has high scalability due to the low resource usage (can reach up to 1, 024 PEs for the XC7VX485T). Besides, our implementation proposal can be implemented in smaller FPGAs, such as the Virtex-6 XC6VLX240T, with a reasonable nucleotide sequence.

Regarding the operation frequency, our proposal can reach up to 155 MHz. So, it is observed that the proposals with the best performances have a similar structure, even with different approaches to the solution. Our proposal and [23] achieving the same performance for the frequency of 150 MHz. The proposal with the highest frequency achieved was that of [47] reaching 500 MHz with Virtex-5 XC5VLX50T FPGA.

Therefore, our work uses fewer hardware resources to perform the alignment process due to the chosen backtracking approach. As the backtracking stage results in high computational complexity, we simplified the process using the path mapping through the maximum value in D and H, resulting in linear computational complexity. On the other hand, the architecture proposed by [23] uses considerably more memory resources due to data partitioning and prefetching for the backtracking step. Despite both works achieving similar performance due to the systolic array, there are significant differences in the alignment approach chosen for the FPGA implementation.

The SW proposed by [41] is—to our knowledge—the most recent work on sequence alignment with the SW algorithm that also embeds the backtracking process in custom hardware. Their design achieved similar performance to ours for 512 PEs, as shown in Table 3. However, our approach can reach up to 1024 PEs embedded in the Virtex-7 XC7VX485T, a lower clock frequency and, thus, double the performance in GCUPS. Despite that, the design proposed by [23] achieved the second best overall performance. Meantime, the architecture proposed in [47], using a register-file concept, achieved 128GCUPS, the best overall performance shown in Table 4.

The hardware implementation of the alignment process through our approach, developed based on a chain of directions and the maximum score address, is a key contribution for the low use of memories. Thus, as we did not carry out tests with real biological datasets, theoretically speaking, it is possible to achieve high hardware scalability. Besides, the sequence of any size can be aligned with our approach limited by the hardware resources available. In addition, the proposed method can compress the data, using only 3 bits in a fixed-point implementation.

5 Conclusion

This paper presented a parallel FPGA platform design to accelerate both the forward and backtracking stages of the SW algorithm. The main contributions were the high-speed data processing implementation and low memory usage that theoretically allows high scalability. The hardware resources available on the FPGA are a limiting factor to the size of the score matrix but not to the size of sequences to be aligned. Therefore, satisfying the high-throughput, ultra-low-latency and low-power requirements and to alleviate the raw data processing problem in bioinformatics. From the strategy of storing alignment path distances and maximum score position during forward stage processing, it was possible to reduce the complexity of backtracking stage processing which allowed to follow the path directly. The proposal architecture achieved a satisfactory critical path, reduced memory usage and, theoretically, a high scalability for two-step SW algorithm. Synthesis results showed that the proposed method could support up to 1, 024 PEs in only one FPGA, using the Xilinx Virtex-7 XC7VX485T. The main advantage is the low hardware resource usage and high performance of 79.5 GCUPS, with an operating frequency of up to 155MHz, without using external resources.

References

  1. 1. Masseroli M, Canakoglu A, Pinoli P, Kaitoua A, Gulino A, Horlova O, et al. Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data. Bioinformatics. 2018;35(5):729–736.
  2. 2. Pereira R, Oliveira J, Sousa M. Bioinformatics and Computational Tools for Next-Generation Sequencing Analysis in Clinical Genetics. Journal of Clinical Medicine. 2020;9(1). pmid:31947757
  3. 3. Schuster S. Next-generation sequencing transforms today’s biology. Nature methods. 2008;5:16–8. pmid:18165802
  4. 4. Kumar G, Kocour M. Applications of next-generation sequencing in fisheries research: A review. Fisheries Research. 2017;186:11–22.
  5. 5. Tanjo T, Kawai Y, Tokunaga K, Ogasawara O, Nagasaki M. Practical guide for managing large-scale human genome data in research. Journal of Human Genetics. 2020;66. pmid:33097812
  6. 6. Zhou P, Shi ZL. SARS-CoV-2 spillover events. Science. 2021;371(6525):120–122. pmid:33414206
  7. 7. Lyng GD, Sheils NE, Kennedy CJ, Griffin DO, Berke EM. Identifying optimal COVID-19 testing strategies for schools and businesses: Balancing testing frequency, individual test technology, and cost. PLOS ONE. 2021;16(3):1–13.
  8. 8. Mazzarelli A, Giancola ML, Farina A, Marchioni L, Rueca M, Gruber CEM, et al. 16S rRNA gene sequencing of rectal swab in patients affected by COVID-19. PLOS ONE. 2021;16(2):1–15. pmid:33596245
  9. 9. Miller D, Martin MA, Harel N, Kustin T, Tirosh O, Meir M, et al. Full genome viral sequences inform patterns of SARS-CoV-2 spread into and within Israel. Nature Communications. 2020;11:5518. pmid:33139704
  10. 10. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215(3):403–410. pmid:2231712
  11. 11. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology. 1970;48(3):443–453. pmid:5420325
  12. 12. Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of Molecular Biology. 1981;147(1):195–197. pmid:7265238
  13. 13. Afifi S, Gholamhosseini H, Sinha R. Hardware Implementations of SVM on FPGA: AState-of-the-Art Review of Current Practice. International Journal of Innovative Science, Engineering & Technology (IJISET). 2015;2:733–752.
  14. 14. Aijaz A, Dohler M, Aghvami AH, Friderikos V, Frodigh M. Realizing the Tactile Internet: Haptic Communications over Next Generation 5G Cellular Networks. IEEE Wireless Communications. 2017;24(2):82–89.
  15. 15. Houtgast EJ, Sima VM, Bertels K, Al-Ars Z. Hardware acceleration of BWA-MEM genomic short read mapping for longer read lengths. Computational Biology and Chemistry. 2018;75:54–64. pmid:29747076
  16. 16. Courneya JP, Mayo A. High-performance computing service for bioinformatics and data science. Journal of the Medical Library Association: JMLA. 2018;106:494–495. pmid:30271293
  17. 17. Arenas M, Mora A, Romero G, Castillo P. GPU Computation in Bioinformatics. A review. Advances in Intelligent Modelling and Simulation. 2012; p. 433–440.
  18. 18. Khan D, Shedole S. Accelerated Deep Learning in Proteomics—A Review. Innovation in Electrical Power Engineering, Communication, and Computing Technology. 2020; p. 291–300.
  19. 19. González-Domínguez J, Ramos S, Touriño J, Schmidt B. Parallel pairwise epistasis detection on heterogeneous computing architectures. IEEE Transactions on Parallel and Distributed Systems. 2015;27(8):2329–2340.
  20. 20. Letras M, Bustio-Martínez L, Cumplido R, Hernández-León R, Feregrino-Uribe C. On the design of hardware architectures for parallel frequent itemsets mining. Expert Systems with Applications. 2020;157:113440.
  21. 21. Juvonen MPT, Coutinho JGF, Wang JL, Lo BL, Luk W, Mencer O, et al. Custom hardware architectures for posture analysis. In: Proceedings. 2005 IEEE International Conference on Field-Programmable Technology, 2005.; 2005. p. 77–84.
  22. 22. Kaplan R, Yavits L, Ginosar R, Weiser U. A Resistive CAM Processing-in-Storage Architecture for DNA Sequence Alignment. IEEE Micro. 2017;37(4):20–28.
  23. 23. Fei X, Dan Z, Lina L, Xin M, Chunlei Z. FPGASW: Accelerating Large-Scale Smith–Waterman Sequence Alignment Application with Backtracking on FPGA Linear Systolic Array. Interdisciplinary Sciences: Computational Life Sciences. 2017;10. pmid:28432608
  24. 24. Cadenelli N, Jaksic Z, Polo J, Carrera D. Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads. Future Generation Computer Systems. 2019;94:148–159.
  25. 25. Franke K, Crowgey E. Accelerating next generation sequencing data analysis: an evaluation of optimized best practices for Genome Analysis Toolkit algorithms. Genomics & Informatics. 2020;18:e10. pmid:32224843
  26. 26. Nobile MS, Cazzaniga P, Tangherloni A, Besozzi D. Graphics processing units in bioinformatics, computational biology and systems biology. Briefings in Bioinformatics. 2016;18(5):870–885.
  27. 27. Manconi A, Moscatelli M, Gnocchi M, Armano G, Milanesi L. A GPU-based high performance computing infrastructure for specialized NGS analyses. In: PeerJ Preprints; 2016. p. 3.
  28. 28. Kung . Why systolic architectures? Computer. 1982;15(1):37–46.
  29. 29. Kung HT, McDanel B, Zhang SQ. Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS’19. New York, NY, USA: Association for Computing Machinery; 2019. p. 821–834. Available from: https://doi.org/10.1145/3297858.3304028.
  30. 30. Sze V. Designing Hardware for Machine Learning: The Important Role Played by Circuit Designers. IEEE Solid-State Circuits Magazine. 2017;9(4):46–54.
  31. 31. Dias LA, Ferreira JC, Fernandes MAC. Parallel Implementation of K-Means Algorithm on FPGA. IEEE Access. 2020;8:41071–41084.
  32. 32. Dias LA, Damasceno AM, Gaura E, Fernandes MA. A full-parallel implementation of Self-Organizing Maps on hardware. Neural Networks. 2021;. pmid:34112575
  33. 33. Barros WK, Dias LA, Fernandes MA. Fully Parallel Implementation of Otsu Automatic Image Thresholding Algorithm on FPGA. Sensors. 2021;21(12):4151. pmid:34204291
  34. 34. Hughey R, Lopresti DP. Architecture of a programmable systolic array. In: [1988] Proceedings. International Conference on Systolic Arrays; 1988. p. 41–49.
  35. 35. He D, He J, Liu J, Yang J, Yan Q, Yang Y. An FPGA-Based LSTM Acceleration Engine for Deep Learning Frameworks. Electronics. 2021;10(6).
  36. 36. Zhang H, Fu Y, Feng L, Zhang Y, Hua R. Implementation of Hybrid Alignment Algorithm for Protein Database Search on the SW26010 Many-Core Processor. IEEE Access. 2019;7:128054–128063.
  37. 37. Rognes T. Faster Smith-Waterman database searches by inter-sequence SIMD parallelisation. BMC bioinformatics. 2011;12:221. pmid:21631914
  38. 38. Liu Y, Maskell D, Schmidt B. CUDASW++: Optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units. BMC research notes. 2009;2:73. pmid:19416548
  39. 39. Court T, Herbordt M. Families of FPGA-Based Accelerators for Approximate String Matching. Microprocessors and microsystems. 2007;31:135–145. pmid:21603598
  40. 40. Rucci E, Garcia C, Botella G, Giusti AED, Naiouf M, Prieto-Matias M. OSWALD: OpenCL Smith–Waterman on Altera’s FPGA for Large Protein Databases. The International Journal of High Performance Computing Applications. 2018;32(3):337–350.
  41. 41. Wu JP, Lin YC, Wu YW, Hsieh SW, Tai CH, Lu YC. A Memory-Efficient Accelerator for DNA Sequence Alignment with Two-Piece Affine Gap Tracebacks. In: 2021 IEEE International Symposium on Circuits and Systems (ISCAS); 2021. p. 1–4.
  42. 42. Banerjee SS, El-Hadedy M, Lim JB, Kalbarczyk ZT, Chen D, Lumetta SS, et al. ASAP: Accelerated Short-Read Alignment on Programmable Hardware. IEEE Transactions on Computers. 2019;68(3):331–346.
  43. 43. Saavedra A, Lehnert H, Hernández C, Carvajal G, Figueroa M. Mining Discriminative K-Mers in DNA Sequences Using Sketches and Hardware Acceleration. IEEE Access. 2020;8:114715–114732.
  44. 44. Lloyd S, Snell QO. Sequence Alignment with Traceback on Reconfigurable Hardware. In: 2008 International Conference on Reconfigurable Computing and FPGAs; 2008. p. 259–264.
  45. 45. Alser M, Hassan H, Kumar A, Mutlu O, Alkan C. Shouji: A fast and efficient pre-alignment filter for sequence alignment. Bioinformatics (Oxford, England). 2019;35:4255–4263. pmid:30923804
  46. 46. Rucci E, Garcia C, Botella G, De Giusti A, Naiouf M, Prieto Matias M. SWIFOLD: Smith-Waterman implementation on FPGA with OpenCL for long DNA sequences. BMC Systems Biology. 2018;12. pmid:30458766
  47. 47. Sarkar A, Banerjee S, Ghosh S. An Energy-Efficient Pipelined-Multiprocessor Architecture for Biological Sequence Alignment. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 2020;28(12):2598–2611.
  48. 48. Nurdin D, md isa mn, Ismail RC, Ahmad MI. High Performance Systolic Array Core Architecture Design for DNA Sequencer. MATEC Web of Conferences. 2018;150:06009.
  49. 49. Arram J, Luk W, Jiang P. Reconfigurable filtered acceleration of short read alignment. In: 2013 International Conference on Field-Programmable Technology (FPT); 2013. p. 438–441.
  50. 50. Ng HC, Liu S, Coleman I, Chu RSW, Yue MC, Luk W. Acceleration of Short Read Alignment with Runtime Reconfiguration. In: 2020 International Conference on Field-Programmable Technology (ICFPT); 2020. p. 256–262.
  51. 51. Seliem AG, Hamed HFA, Abouelwafa W. MapReduce Model Using FPGA Acceleration for Chromosome Y Sequence Mapping. IEEE Access. 2021;9:83402–83409.
  52. 52. Koliogeorgi K, Voss N, Fytraki S, Xydis S, Gaydadjiev G, Soudris D. Dataflow Acceleration of Smith-Waterman with Traceback for High Throughput Next Generation Sequencing. In: 2019 29th International Conference on Field Programmable Logic and Applications (FPL); 2019. p. 74–80.
  53. 53. Chen YL, Chang BY, Yang CH, Chiueh TD. A High-Throughput FPGA Accelerator for Short-Read Mapping of the Whole Human Genome. IEEE Transactions on Parallel and Distributed Systems. 2021;32(6):1465–1478.
  54. 54. Siddiqui F, Amiri S, Minhas UI, Deng T, Woods R, Rafferty K, et al. FPGA-Based Processor Acceleration for Image Processing Applications. Journal of Imaging. 2019;5(1). pmid:34465705
  55. 55. Pilz S, Porrmann F, Kaiser M, Hagemeyer J, Hogan JM, Rückert U. Accelerating Binary String Comparisons with a Scalable, Streaming-Based System Architecture Based on FPGAs. Algorithms. 2020;13(2).
  56. 56. Rashed AEED, Obaya M, Moustafa HED. Accelerating DNA pairwise sequence alignment using FPGA and a customized convolutional neural network. Computers & Electrical Engineering. 2021;92:107112.
  57. 57. Benkrid K, Liu Y, Benkrid A. A Highly Parameterized and Efficient FPGA-Based Skeleton for Pairwise Biological Sequence Alignment. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 2009;17(4):561–570.
  58. 58. Isa MN, Benkrid K, Clayton T, Ling C, Erdogan AT. An FPGA-based parameterised and scalable optimal solutions for pairwise biological sequence analysis. In: 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS); 2011. p. 344–351.
  59. 59. Sebastiao N, Roma N, Flores P. Integrated Hardware Architecture for Efficient Computation of the n-Best Bio-Sequence Local Alignments in Embedded Platforms. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 2012;20(7):1262–1275.
  60. 60. Xilinx. System Generator for DSP; 2008, Accessed on Jan 30, 2020. Available from: https://www.xilinx.com/.
  61. 61. Vasco P. Smith-Waterman-Algorithm; 2019, Accessed on June 04, 2021. Available from: https://github.com/pedrovasco96/Smith-Waterman-Algorithm/.
  62. 62. Oliveira F, Fernandes M. Smith-Waterman-Algorithm Demo; 2021, Accessed on June 22, 2021. Available from: https://drive.google.com/drive/folders/1Mr78U1MNA6HvKV1fWA248Zp05LCGdJN0?usp=sharing.
  63. 63. Oliveira F, Fernandes M. Smith-Waterman-Algorithm-on-FPGA; 2021, Accessed on December 02, 2021. Available from: https://github.com/Veritate/Smith-Waterman-Algorithm-on-FPGA.
  64. 64. Oliver T, Schmidt B, Maskell D. Hyper customized processors for bio-sequence database scanning on FPGAs; 2005. p. 229–237.
  65. 65. Zhang P, Tan G, Gao G. Implementation of the Smith-Waterman algorithm on a reconfigurable supercomputing platform; 2007. p. 39–48.
  66. 66. Storaasli O, Yu W, Strenski D, Maltby J. Performance Evaluation of FPGA-Based Biological Applications. Seattle; 2007.
  67. 67. Alachiotis N, Berger SA, Stamatakis A. Accelerating Phylogeny-Aware Short DNA Read Alignment with FPGAs. In: 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines; 2011. p. 226–233.
  68. 68. Olson CB, Kim M, Clauson C, Kogon B, Ebeling C, Hauck S, et al. Hardware Acceleration of Short Read Mapping. In: 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines; 2012. p. 161–168.
  69. 69. Preuber TB, Knodel O, Spallek RG. Short-Read Mapping by a Systolic Custom FPGA Computation. In: 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines; 2012. p. 169–176.
  70. 70. Tang W, Wang W, Duan B, Zhang C, Tan G, Zhang P, et al. Accelerating Millions of Short Reads Mapping on a Heterogeneous Architecture with FPGA Accelerator. In: 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines; 2012. p. 184–187.
  71. 71. Chen P, Wang C, Li X, Zhou X. Accelerating the Next Generation Long Read Mapping with the FPGA-Based System. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2014;11(5):840–852. pmid:26356857