An Integer Programming Formulation of the Minimum Common String Partition Problem

S. M. Ferdous; M. Sohel Rahman

doi:10.1371/journal.pone.0130266

Abstract

We consider the problem of finding a minimum common string partition (MCSP) of two strings, which is an NP-hard problem. The MCSP problem is closely related to genome comparison and rearrangement, an important field in Computational Biology. In this paper, we map the MCSP problem into a graph applying a prior technique and using this graph, we develop an Integer Linear Programming (ILP) formulation for the problem. We implement the ILP formulation and compare the results with the state-of-the-art algorithms from the literature. The experimental results are found to be promising.

Figures

Citation: Ferdous SM, Rahman MS (2015) An Integer Programming Formulation of the Minimum Common String Partition Problem. PLoS ONE 10(7): e0130266. https://doi.org/10.1371/journal.pone.0130266

Editor: Lars Kaderali, Technische Universität Dresden, Medical Faculty, GERMANY

Received: July 20, 2014; Accepted: May 19, 2015; Published: July 2, 2015

Copyright: © 2015 Ferdous, Rahman. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: The authors have no support or funding to report.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

In the minimum common string partition (MCSP) problem, we are given two related strings (S, T). Two strings are said to be related if the frequencies of each letter in the two strings match. A partition of a string S is defined as a sequence P = (b₁, b₂, …, b_c), where b_i are substrings of S whose concatenation is equal to S, i.e., b₁ b₂…b_c = S. Given a partition P of a string S and a partition Q of a string T, we say that the pair π = < P, Q > is a common partition of (S, T) if Q is a permutation of P. The minimum common string partition problem is to find a common partition of (S, T) with the minimum number of substrings, that is to minimize c. For example, if (S, T) = (atatgat,atgatat), then an optimal solution is π = {atgat,at} and the minimum common partition size is 2. The restricted version of MCSP where each letter occurs at most d times in each input string, is denoted by d-MCSP. A more detailed study of the application of MCSP can be found in [1], [2] and [3].

In this paper, we present an Integer Linear Programming (ILP) formulation for the MCSP problem. In particular, we use a graph mapping that was presented in our prior work [4] to solve the MCSP problem using the Ant Colony Optimization technique [5]. Here we exploit this graph to devise an ILP formulation for the problem. Then we implement the ILP formulation, conduct extensive experiments and compare the results with the state-of-the-art algorithms from the literature. As will be reported in a later section, the results clearly indicate that the ILP formulation is effective and provides excellent results. One of the intriguing findings of our work is the fact that our ILP formulation turns out to be more effective and accurate than our meta-heuristics approach presented in [4]. This is especially interesting because both the algorithms are based on the same graph that is constructed through an interesting mapping [4].

The rest of the paper is organized as follows. In Section 2 we present a brief literature review. Section 3 presents the notations and definitions used in this paper. In Section 4 we present the ILP formulation for the MCSP problem. We present our experimental results in Sections 5 followed by a brief relevant discussion in Section 6. Finally, we briefly conclude in Section 7.

2 Related Works

The 1-MCSP problem is essentially the breakpoint distance problem [6] between two permutations, which is solvable in polynomial time [1]. The 2-MCSP problem has been shown to be NP-hard and moreover APX-hard in [1]. The authors in [1] also have presented several approximation algorithms to solve the problem. In [2], Chen et al. have studied a generalization of the MCSP problem called the Signed Reversal Distance with Duplicates (SRDD). Furthermore, they have presented a 1.5-approximation algorithm for the 2-MCSP problem. In [7], Damaschke has analyzed the fixed-parameter tractability of the MCSP problem considering different parameters. The MCSP problem is also studied in [8], where it is termed as the true evolutionary distance problem between two genomes. In [9], the authors have investigated the d-MCSP problem along with two other variants, namely, MCSP^c, where the alphabet size is at most c and x-balanced MCSP, which requires that the length of blocks be at most x away from the average length. They have shown that MCSP^c is NP-hard when c ≥ 2. As for d-MCSP, they have presented an fixed parameter tractable (FPT) algorithm which runs in O*((d!)^k) time, where k is the number of blocks in the optimal common partition. The result has been improved by Bulteau et al. [10] by showing that MCSP can be solved in O(d^2k ⋅ kn) time. Recently, Bulteau and Komusiewicz [11] have introduced the first fixed-parameter algorithm for the MCSP problem using parameter k only.

Chrobak et al. [3] have analyzed a natural greedy heuristic for the MCSP problem: iteratively, at each step, it extracts a longest common substring from the input strings. They have shown that for the 2-MCSP problem, the approximation ratio (for the greedy heuristic) is exactly 3. They also have proved that for the 4-MCSP problem the ratio is log n and for the general case, it lies between Ω(n^0.43) and O(n^0.67). In [12], He has proposed an improved greedy algorithm based on the greedy strategy of [3], where the idea is to extract the longest common substring containing a symbol occurring only once at each step whenever there is such a symbol.

In our prior work [4], we have developed a meta-heuristc algorithm, namely, MAX-MIN ant system to solve the MCSP problem. In particular, in [4], we have mapped the instance of the MCSP problem into a graph, namely, the common substring graph. MAX-MIN Ant System has been implemented over this graph. Recently in [13], Blum et al. have proposed an iterative probabilistic tree search algorithm for solving this problem. The algorithm is an iterative probabilistic variant of the greedy algorithm of [3]. The authors have tested their approach with the dataset introduced in [4]. Subsequently, a common block based ILP formulation has been proposed in [14] by Blum et al. They have tested their ILP formulation on the previous benchmarks [4] as well as on a new benchmark of 7 larger instances.

3 Preliminaries

This section summarizes the definitions and notations used throughout the paper. Two strings (S, T), of equal length (n), over an alphabet ∑ are called related if the frequencies of the letters in the two strings match. We define a block B = [S, i, j], 0 ≤ i ≤ j < n, of a string S as a data structure where i and j denote the starting and ending positions of the block. A block, [S, i, j] represents a substring of S denoted as substring([S, i, j]) with length (j − i+1).

As an example, if we have two strings (S, T) = (atgcat,tgcata), then [S, 0, 1] and [S, 4, 5] both represent the substring at of S. In other words, substring([S, 0, 1]) = substring([S, 4, 5]) = at. We say that a block B matches with another block B′ if the two blocks represent the same substrings. Given a list of blocks l_b, matchList(l_b, B) is defined as a list of those blocks of l_b that match B. For the example stated above, let a list of blocks be l_b = {[S, 0, 1], [S, 1, 1], [S, 4, 5]} and B = [S, 0, 1]; then matchList(l_b, B) = {[S, 0, 1], [S, 4, 5]}.

We use the notion of a common substring graph as introduced in [4]. A common substring graph, G_cs(V, E, S) of two strings (S, T) is defined as follows. Here V is the vertex set of the graph and E is the edge set. Vertices are the positions of string S, i.e., for each v ∈ V, v ∈ {0, n − 1}. Two vertices v_i ≤ v_j are connected with an edge, i.e, (v_i, v_j) ∈ E, if the substring induced by the block [S, v_i, v_j] matches some substring of T. More formally, if S_T denotes the set of all substrings of T, we have:

In other words, each edge in the edge set corresponds to a block satisfying the above condition. For convenience, we will denote the edges as edge blocks and use the list of edge blocks (instead of edges) to define the edge set E.

For example, suppose (S, T) = (atgcta,atgcat). The corresponding common substring graph of the first string S, denoted by G_cs(V, E, S), will have vertex set, V = {0, 1, 2, 3, 4, 5} and edge set, E = {[S, 0, 0], [S, 1, 1], [S, 2, 2], [S, 3, 3], [S, 4, 4], [S, 5, 5], [S, 0, 1], [S, 1, 2], [S, 2, 3], [S, 0, 2], [S, 1, 3], [S, 0, 3]}.

4 ILP Formulation

Suppose we are given two related strings (S, T), each of length n. We create two graphs, namely, G_cs(V₁, E₁, S) and G_cs(V₂, E₂, T) of (S, T), where V₁ and V₂ are the vertex sets and E₁ and E₂ are the edge block sets of the two graphs respectively. We define two sets of binary variables, namely, x_t₁ and y_t₂ where t₁ ∈ E₁ and t₂ ∈ E₂. We also write δ_k(v)⁻ and δ_k(v)⁺ for the sets of incoming and outgoing edge blocks from E_k where v ∈ V_k and k ∈ {1, 2}. An incoming (outgoing) edge block is the one whose starting (ending) position i (j) is 0 (n − 1). With the above setting, we develop an ILP formulation (denoted as ILP_graph) for the MCSP problem using the common substring graph as follows: (1) (2) (3) (4) (5) (6) (7) (8)

4.1 Explanation of the Formulation

Objective function.

Eq 1 is the objective function that is to be minimized. The function simply calculates the size of the partition.

Equality constraint.

Eq 2 states that two partitions on the two substring graphs must be of equal size. In other words, the number of blocks in the factorization of the first string S must be equal to the number of blocks in the factorization of the second string T.

Factorization constraint.

Eqs 3 and 4 together ensures that a unit flow enters at the source (the vertex labelled with 0) and arrives at the sink (the vertex labelled with n − 1) for string S. So, the string is factorized. For string T the factorization is achieved in a similar fashion by Eqs 5 and 6. These constraints ensure that the strings get factorized by non-overlapping blocks.

One to one match constraint.

We have two sets of blocks after the factorization. We must ensure that there is a one to one matching between the two sets of blocks. By matching we mean that, for each selected block (with x_t = 1 where t ∈ E₁) of the first edge block set E₁, there must be one and only one corresponding selected block (with y_t = 1 where t ∈ E₂) with the same substring in the second edge block set E₂ and vice versa. Eq 7 achieves the one to one matching by ensuring that for each edge block, the number of selected blocks in E₁ equals the number of selected blocks in E₂.

Integrality constraint.

Eq 8 ensures the integrality of the variables.

This is a polynomial formulation. The number of variables as well as the number of constraints of the formulation depends on the size of the edge block sets, E₁ and E₂. In the worst case, the number of variables and constraints can be O(n²), where n is the size of the vertex set. But in practice the number of variables is much less than that which is evident from the experimental results as reported in the following section.

5 Experiments

Except for one, we have conducted all our experiments in a computer with Intel(R) Core(TM) i5-2450M CPU @2.50 GHz having an installed memory (RAM) of 4.00 GB. There is one particular experiment that has been run in another machine with the same configuration except that the available RAM was higher, 8.00 GB. The operating system was Windows 8.1. The programming environment was Matlab. We have used SCIP (version 3.1.0) standalone solver [15] to solve ILP_graph.

5.1 Data sets

We have conducted our experiments on 5 sets of random synthetic data (henceforth labelled as Group1-Group5) and a real gene sequence dataset (henceforth labelled as Real). The datases are briefly described below.

Group1-Group3.

In our previous work [4], we generated uniform random DNA sequences, each of length at most 600, using “FaBox (1.41)” [16]. A pair of DNA sequences (S, T) was generated by randomly shuffling [16] one DNA sequence from the set using “Sequence Manipulation Suite” [17]. This dataset is divided into 3 groups. The first 10 (Group1) have lengths less than or equal to 200 bps (base-pairs), the next 10 (Group2) have lengths within [201, 400] and the rest 10 (Group3) have lengths within [401, 600] bps. Notably, these datases are also used for experimentation and analysis by researchers in recent papers [13, 14].

Group4.

We have also tested our formulation with a new random dataset collected through personal communication with Christian Blum, one of the co-authors of [14]. This new dataset is a collection of 300 uniform random instances of different lengths and alphabet sizes. The sequences in the dataset are of lengths {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000} and of alphabet size {4, 12, 20}. In particular, for each length there are 30 sequences among which the first 10 are of alphabet size 4, the next 10 are of alphabet size 12 and the rest are of alphabet size 20.

Group5.

This dataset was introduced in [14] to test the solving limit of their ILP formulation. This constitutes 7 instances of length {800, 1000, 1200, 1400, 1600, 1800, 2000}.

Real.

We have used the real gene sequence data used in [4]. This data correspond to the first 15 gene sequences of Bacterial Sequencing (part 14) whose lengths are within [200, 600].

5.2 Implementation

SCIP [15] (version 3.1.0) standalone solver is used to solve the ILP formulation. SCIP runs on single thread [18]. The solution of an instance is a two steps procedure. Firstly for each instance we have to generate the variables and constraints in a format that is understandable to SCIP. Using Matlab we have generated the MPS (Mathematical Programming System) files of the instances. These files are the input to the solver. For the solver, we have enforced a time limit of 3600 cpu seconds for Group1-Group3, Group4 and Real. The First 5 out of the 7 instances of Group5 have been allowed 3600 seconds each whereas the other 2 have been given 7200 seconds each. All other parameters have been left default.

5.3 Results and Analysis

In an updated and extended version [19] (the preprint is available at [20]) of our earlier work [4], MAX-MIN ACO (referred to as MMAS henceforth) has been compared with the greedy algorithm of [3]. In [13], the authors have compared their two versions of iterative probabilistic tree search (TS1 and TS2) with Greedy and MMAS. Here we report only the best of the two tree search solutions (henceforth referred to as TS). Recently in [14], the authors have compared the results of their ILP formulation (ILP_orig) with Greedy, MMAS and TS. Here, we compare our ILP formulation, i.e., ILP_graph with MMAS [4, 19, 20], TS [13] and ILP_orig[14]. As for the greedy algorithm, we have considered the improved greedy approach in [12] (henceforth labelled as Greedy).

Table 1 presents the comparison among the results of ILP_graph and other competitive approaches for Group1-Group3 and Real dataset. For each group the first column is the instance number. The second, third and forth columns represent the common partition size by Greedy [12], MMAS [19] and TS [13] respectively. The fifth to eighth column summarize the results of ILP_orig. The result is obtained from [14]. The fifth column is the partition size. The sixth column is the time in second, presented as X/Y format only when the solver has been unable to find the optimal solution in 3600 cpu seconds; otherwise it is shown as a single value format reporting the time to get the optimal solution. The seventh column report the relative gap, where gap is defined as the difference between the value of the best valid solution (primal bound) and the lower bound (dual bound) of the problem. The relative gap is formulated as ∣(upperbound − lowerbound)/min(∣upperbound∣, ∣lowerbound∣)∣. The eighth column is the number of variables in the formulation for the instance. The last four columns report the result of our formulation, ILP_graph. The columns here reports the same information as the fifth to eighth columns. The best result for an instance is boldfaced.

Download:

Table 1. Comparison for Group1-Group3 and Real dataset.

https://doi.org/10.1371/journal.pone.0130266.t001

From Table 1, it is easily verified that ILP_graph provides much better common partition size than other approaches. Out of 45 instances, it provides equal or better partition size than ILP_orig in 42 cases, amongst which 23 are strictly better. The improvement is not only in the solution size but also in computational time. Except for Group1, ILP_graph has been able to achieve improved solution in significantly less time than ILP_orig. The number of variables are also dramatically reduced in ILP_graph. Fig 1, shows the percentage of improvement of ILP_graph over the other five approaches considered. The significant improvement can be perceived from the figure.

Download:

Fig 1. Percentage of improvement of ILP_graph over Greedy, MMAS, TS and ILP_orig.

Top: Improvement in average solution. Bottom: Improvement in median solutions.

https://doi.org/10.1371/journal.pone.0130266.g001

Table 2 reports the average results of Group4 dataset. Here the average of the results of 10 instances for each length group having a particular alphabet size is reported. For example, the first row reports the average results of ten 100-length instances on an alphabet size of 4. The result of ILP_orig is collected through personal communication with the author of [14]. It is notable that for the Group4 dataset, ILP_orig was implemented using GCC 4.7.3 and IBM ILOG CPLEX V12.1. Moreover, as reported in [14], the corresponding experiments were conducted on a cluster of PCs with 2933MHz Intel(R) Xeon(R) 5670 CPUs having 12 nuclei and 32GB RAM. The third to seventh columns report the solution of ILP_orig while the eighth to thirteenth columns report the solution of ILP_graph. The columns report the same information as in Table 1 with four exceptions as follows. Firstly, the time when the first valid solution is achieved and the time when the best solution is achieved within the time limit (3600 sec) are presented in two different columns (labelled as ftime and time respectively). Secondly, for each formulation, how many among the 10 instances (represented by each row) have been solved optimally is reported in the column named #opt. Finally, the last two columns represent the percentage of improvement in average partition size and the percentage of decrease in the number of variables of ILP_graph over ILP_orig respectively.

Download:

Table 2. Comparison of average results on Group4 dataset.

https://doi.org/10.1371/journal.pone.0130266.t002

Like Group1-Group3 and Real dataset, the results of Table 2 draw the same conclusion. The ILP_graph formulation provides better solutions than ILP_orig in almost every aspect. Numerically, ILP_graph gets equal or better average partition in 28 out of 30 instances of which 12 are strictly better. The number of instances solved optimally by ILP_graph is 172 (out of 300) which is 12 more than that of ILP_orig. The percentage of improvements in the average solutions also proves the superiority of ILP_graph. As it is evident from Table 2, the improvement gets more acute with the increase of the string length and the decrease of the alphabet size. This observation is also supported by Table 3 that reports the solutions of the two formulations for Group5 dataset. The 7 instances of Group5 were introduced in [14] to test the limit of their formulation. Their simulation [14] was conducted in a cluster of PCs with “Intel(R) Xeon(R) CPU 513” CPUs of 4 nuclei of 2000 MHz and 4 Gigabyte of RAM with the time limit of 12 hours. On the other hand, for this dataset, we have enforced 3600 seconds for the first 5 instances and 7200 seconds for the last two. ILP_orig could not achieve a valid solution for the last instance even within 12 hours whereas ILP_graph got a valid solution in 6100 seconds. From the percentage of improvement (%impr) it can be concluded that, ILP_graph achieves better partition size with less time as the length of the string increases. The number of variables also become intractable for ILP_orig as the length increases. All of these results speak in favor of ILP_graph.

Download:

Table 3. Comparison result for Group5 dataset.

NSF means “No solutions found”.

https://doi.org/10.1371/journal.pone.0130266.t003

Finally, to further test the limit of our formulation, i.e., ILP_graph, we have conducted an experiment with an instance of length 3000 on the machine with 8GB of RAM. The time limit was set to 12 hours. ILP_graph has been able to get a valid solution of partition size 642 in 11 hours.

5.4 Running Time

In the previous section, we have shown that ILP_graph provides much better partition size. In this section we will explore the running time of ILP_graph. It is clear from Tables 1–3 that ILP_graph achieves faster solution in most of the cases even running on a slower processor having lesser memory. This is also true for the first valid solution it provides. Fig 2 shows the comparison of the average first valid solution time for three groups based on the alphabet size in Group4 dataset. From this figure it is clear that ILP_graph finds the first valid solution faster than ILP_orig and the difference in the running time becomes more apparent as the length of the string increases. Now we concentrate on comparing the running time of ILP_graph with the other four approaches. The running times of the two tree search algorithms (referred to as TS1 and TS2) are taken from [15]. The running time of MMAS is taken from [19]. The Greedy algorithm is very fast. It gives the output within few seconds. So, in the analysis, we will assume that the output of Greedy algorithm is readily available even at the beginning of the simulation. We have recorded the primal solution (partition size) of our algorithm periodically. Figs 3–6 show the detailed runtime comparison among the algorithms for Group1, Group2, Group3 and Real datasets respectively. For each group we have shown the average partition size dynamics with respect to time. The three points (“*”,“+”,“o”) in each of the figures are the plots of average partition size vs. the average time needed to achieve that partition size for MMAS, TS1 and TS2 approach respectively (data taken from [13], [19]). The dashed line represents the Greedy partition size.

Download:

Fig 2. Average time for the first valid solution found by ILP_graph on Group4 data.

Top: Alphabet size 4. Middle: Alphabet size 12. Bottom: Alphabet size 20.

https://doi.org/10.1371/journal.pone.0130266.g002

Download:

Fig 3. Avg. solution Vs. time comparison (Group1).

https://doi.org/10.1371/journal.pone.0130266.g003

Download:

Fig 4. Avg. solution Vs. time comparison (Group2).

https://doi.org/10.1371/journal.pone.0130266.g004

Download:

Fig 5. Avg. solution Vs. time comparison (Group3).

https://doi.org/10.1371/journal.pone.0130266.g005

Download:

Fig 6. Avg. solution Vs. time comparison (Real).

https://doi.org/10.1371/journal.pone.0130266.g006

Although the reported time of ILP_graph (in Table 1) is higher than that of Greedy, TS1 and TS2 approaches in some instances but from the Figs 3–6, it can be easily observed that the ILP_graph algorithm reaches to better solutions much earlier. From the figures it is clear that ILP_graph is better than Greedy at any stage of time. Even if we stop the algorithm at or earlier than the average runtime of MMAS, TS1 or TS2, the ILP_graph provides better solutions.

6 Discussion

At this point a brief discussion on the number of variables in the two ILP formulations, namely, ILP_orig and ILP_graph, is in order. In Fig 7, we show the comparison of the number of variables between the two formulations for Group4 dataset. Although both formulations have O(n²) variables, we observe a significant decrease in the number of variables in ILP_graph than ILP_orig. The average improvement in the number of variables are reported in the last column of Tables 2 and 3 for Group4 and Group5 datasets respectively. For the Group4 dataset the maximum and minimum percentage of decrease in the number of variables are 97.06% and 57.55% with the average improvement of 88.24% while for the Group5 dataset the maximum and minimum are 98.22% and 96.48% with an average of 97.52%.

Download:

Fig 7. Comparison of average number of variables between ILP_orig and ILP_graph for Group4 dataset.

Top: Alphabet size 4. Middle: Alphabet size 12. Bottom: Alphabet size 20.

https://doi.org/10.1371/journal.pone.0130266.g007

The drastic improvement in the number of variables for ILP_graph and the lack thereof for ILP_orig can easily be understood by analyzing the variable set of the two formulations. ILP_orig is based on common blocks. A common block b of two strings (S, T) is defined in [14] as a triple (t, k₁, k₂). Here t is a common substring of (S, T) that appeared at position k₁ of S and k₂ of T where 0 ≤ k₁, k₂ ≤ n − 1. B = {B₁, B₂,…B_m} is the (ordered) set of all common blocks of (S, T). This set is the variable set of ILP_orig. For an example, if (S, T) = (aaagggccc,gggaaaccc), then the number of common blocks would be 42. To find this, first concentrate on a common substring from S and T, namely aaa. The common blocks resulting from this common substring are, B = {[a, 0, 3], [a, 0, 4], [a, 0, 5], [a, 1, 3], [a, 1, 4], [a, 1, 5], [a, 2, 3], [a, 2, 4], [a, 2, 5], [aa, 0, 3], [aa, 0, 4], [aa, 1, 3], [aa, 1, 4], [aaa, 0, 3]}. Similar common blocks can be computed for the other two common substrings (ggg and ccc) too. On the other hand the number of variables in ILP_graph depends on the number of edges in the common substring graph. Thus, for the above example, if we construct the common substring graph on S, we have 18 edge blocks, E = {[S, 0, 0], [S, 0, 1], [S, 0, 2], [S, 1, 1], [S, 1, 2], [S, 2, 2], [S, 3, 3], [S, 3, 4], [S, 3, 5], [S, 4, 4], [S, 4, 5], [S, 5, 5], [S, 6, 6], [S, 6, 7], [S, 6, 8], [S, 7, 7], [S, 7, 8], [S, 8, 8]}. Thus ILP_graph reduces the number of variables significantly.

7 Conclusion and Future works

In this paper, we have presented an ILP formulation for the MCSP problem. We have conducted extensive experiments and compared the results with the state-of-the-art algorithms in the literature. The results clearly indicate that the ILP formulation is effective and provides excellent results. The observations of Section 5.4 bear important research directions. The research on this field should now be focussed on finding MCSP for larger instances in reasonable time. As ILP_graph provides better solution faster than the other competitive approaches, one idea is to stop the solver as soon as it gets the first solution. This solution or possibly a set thereof can be used as the initial solution(s) for existing and new meta-heuristic approaches developed to solve this problem including the ones reported in [4] and [13]. Another research direction could be as follows. So, far MCSP has been studied mostly in the context of operations research. However, it has important applications in genome comparison and rearrangement. So, datases from comparative genomics applications could be gathered for further experimental analysis and comparison with relevant algorithms (e.g., [10]) in the field of computational biology.

Supporting Information

S1 Dataset. Group1 dataset.

The text file contains 10 instances in pair for Group1 dataset.

https://doi.org/10.1371/journal.pone.0130266.s001

(TXT)

S2 Dataset. Group2 dataset.

The text file contains 10 instances in pair for Group2 dataset.

https://doi.org/10.1371/journal.pone.0130266.s002

(TXT)

S3 Dataset. Group3 dataset.

The text file contains 10 instances in pair for Group3 dataset.

https://doi.org/10.1371/journal.pone.0130266.s003

(TXT)

S4 Dataset. Group4 dataset.

The compressed folder contains 300 files each containing an instance of lengths from 100 to 1000 separating by alphabet size.

https://doi.org/10.1371/journal.pone.0130266.s004

(TGZ)

S5 Dataset. Group5 dataset.

The compressed folder contains 7 files each consisting an instance of lengths from 800 to 2000.

https://doi.org/10.1371/journal.pone.0130266.s005

(TGZ)

S6 Dataset. Real dataset.

The text file contains 10 instances in pair for Real dataset.

https://doi.org/10.1371/journal.pone.0130266.s006

(TXT)

S7 Dataset. 3000 length instance.

The text file contains an instance of length 3000.

https://doi.org/10.1371/journal.pone.0130266.s007

(TXT)

Acknowledgments

One of the authors, M. Sohel Rahman is currently on a sabbatical leave from Bangladesh University of Engineering and Technology (BUET).

Author Contributions

Conceived and designed the experiments: SMF MSR. Performed the experiments: SMF. Analyzed the data: SMF MSR. Contributed reagents/materials/analysis tools: SMF. Wrote the paper: SMF MSR.

References

1. Goldstein A, Kolman P, Zheng J. Minimum Common String Partition Problem: Hardness and Approximations. Electr J Comb. 2005;12:R#50. Available from: http://www.combinatorics.org/Volume_12/Abstracts/v12i1r50.html.
- View Article
- Google Scholar
2. Chen X, Zheng J, Fu Z, Nan P, Zhong Y, Lonardi S, et al. Assignment of Orthologous Genes via Genome Rearrangement. IEEE/ACM Trans Comput Biol Bioinformatics. 2005 Oct;2(4):302–315.
- View Article
- Google Scholar
3. Chrobak M, Kolman P, Sgall J. The Greedy Algorithm for the Minimum Common String Partition Problem. ACM Trans Algorithms. 2005 Oct;1(2):350–366.
- View Article
- Google Scholar
4. Ferdous SM, Rahman MS. Solving the Minimum Common String Partition Problem with the Help of Ants. In: Tan Y, Shi Y, Mo H, editors. Advances in Swarm Intelligence. vol. 7928 of Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2013. p. 306–313. Available from: http://dx.doi.org/10.1007/978-3-642-38703-6_36.
5. Dorigo M, Di Caro G, Gambardella LM. Ant Algorithms for Discrete Optimization. Artif Life. 1999 Apr;5(2):137–172. pmid:10633574
- View Article
- PubMed/NCBI
- Google Scholar
6. Watterson GA, Ewens WJ, Hall TE, Morgan A. The chromosome inversion problem. Journal of Theoretical Biology. 1982;99(1):1–7. Available from: http://www.sciencedirect.com/science/article/pii/0022519382903848.
- View Article
- Google Scholar
7. Damaschke P. Minimum Common String Partition Parameterized. In: Crandall K, Lagergren J, editors. Algorithms in Bioinformatics. vol. 5251 of Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2008. p. 87–98.
8. Swenson KM, Marron M, Earnest-Deyoung JV, Moret BME. Approximating the True Evolutionary Distance Between Two Genomes. J Exp Algorithmics. 2008 Aug;12:3.5:1–3.5:17.
- View Article
- Google Scholar
9. Jiang H, Zhu B, Zhu D, Zhu H. Minimum Common String Partition Revisited. In: Lee DT, Chen D, Ying S, editors. Frontiers in Algorithmics. vol. 6213 of Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2010. p. 45–52. Available from: http://dx.doi.org/10.1007/978-3-642-14553-7_7.
10. Bulteau L, Fertin G, Komusiewicz C, Rusu I. A Fixed-Parameter Algorithm for Minimum Common String Partition with Few Duplications. In: Darling A, Stoye J, editors. Algorithms in Bioinformatics. vol. 8126 of Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2013. p. 244–258. Available from: http://dx.doi.org/10.1007/978-3-642-40453-5_19.
11. Bulteau L, Komusiewicz C. 8. In: Minimum Common String Partition Parameterized by Partition Size Is Fixed-Parameter Tractable;. p. 102–121. Available from: http://epubs.siam.org/doi/abs/10.1137/1.9781611973402.8.
12. He D. A Novel Greedy Algorithm for the Minimum Common String Partition Problem. In: Mǎandoiu I, Zelikovsky A, editors. Bioinformatics Research and Applications. vol. 4463 of Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2007. p. 441–452. Available from: http://dx.doi.org/10.1007/978-3-540-72031-7_40.
13. Blum C, Lozano J, Pinacho Davidson P. Iterative Probabilistic Tree Search for the Minimum Common String Partition Problem. In: Blesa M, Blum C, Voß S, editors. Hybrid Metaheuristics. vol. 8457 of Lecture Notes in Computer Science. Springer International Publishing; 2014. p. 145–154. Available from: http://dx.doi.org/10.1007/978-3-319-07644-7_11.
14. Blum C, Lozano JA, Davidson P. Mathematical programming strategies for solving the minimum common string partition problem. European Journal of Operational Research. 2015;242(3):769–777. Available from: http://www.sciencedirect.com/science/article/pii/S0377221714008716.
- View Article
- Google Scholar
15. Achterberg T. SCIP: Solving constraint integer programs. Mathematical Programming Computation. 2009;1(1):1–41. http://mpc.zib.de/index.php/MPC/article/view/4.
- View Article
- Google Scholar
16. Villesen, P. FaBox: An online fasta sequence toolbox; 2007. Available from: http://www.birc.au.dk/software/fabox.
17. Stothard P. The sequence manipulation suite: JavaScript programs for analyzing and formatting protein and DNA sequences. BioTechniques. 2000 Jun;28(6). Available from: http://view.ncbi.nlm.nih.gov/pubmed/10868275.
- View Article
- Google Scholar
18. Miltenberger M;. personal communication.
19. Ferdous, SM, Rahman, MS. A MAX-MIN Ant Colony System for Minimum Common String Partition Problem; 2014. Manuscript submitted for publication.
20. Ferdous SM, Rahman MS. A MAX-MIN Ant Colony System for Minimum Common String Partition Problem. CoRR. 2014;abs/1401.4539.

[ref1] 1. Goldstein A, Kolman P, Zheng J. Minimum Common String Partition Problem: Hardness and Approximations. Electr J Comb. 2005;12:R#50. Available from: http://www.combinatorics.org/Volume_12/Abstracts/v12i1r50.html.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Chen X, Zheng J, Fu Z, Nan P, Zhong Y, Lonardi S, et al. Assignment of Orthologous Genes via Genome Rearrangement. IEEE/ACM Trans Comput Biol Bioinformatics. 2005 Oct;2(4):302–315.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Chrobak M, Kolman P, Sgall J. The Greedy Algorithm for the Minimum Common String Partition Problem. ACM Trans Algorithms. 2005 Oct;1(2):350–366.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Ferdous SM, Rahman MS. Solving the Minimum Common String Partition Problem with the Help of Ants. In: Tan Y, Shi Y, Mo H, editors. Advances in Swarm Intelligence. vol. 7928 of Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2013. p. 306–313. Available from: http://dx.doi.org/10.1007/978-3-642-38703-6_36.

[ref5] 5. Dorigo M, Di Caro G, Gambardella LM. Ant Algorithms for Discrete Optimization. Artif Life. 1999 Apr;5(2):137–172. pmid:10633574
View Article
PubMed/NCBI
Google Scholar

[12] View Article

[13] PubMed/NCBI

[14] Google Scholar

[ref6] 6. Watterson GA, Ewens WJ, Hall TE, Morgan A. The chromosome inversion problem. Journal of Theoretical Biology. 1982;99(1):1–7. Available from: http://www.sciencedirect.com/science/article/pii/0022519382903848.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref7] 7. Damaschke P. Minimum Common String Partition Parameterized. In: Crandall K, Lagergren J, editors. Algorithms in Bioinformatics. vol. 5251 of Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2008. p. 87–98.

[ref8] 8. Swenson KM, Marron M, Earnest-Deyoung JV, Moret BME. Approximating the True Evolutionary Distance Between Two Genomes. J Exp Algorithmics. 2008 Aug;12:3.5:1–3.5:17.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref9] 9. Jiang H, Zhu B, Zhu D, Zhu H. Minimum Common String Partition Revisited. In: Lee DT, Chen D, Ying S, editors. Frontiers in Algorithmics. vol. 6213 of Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2010. p. 45–52. Available from: http://dx.doi.org/10.1007/978-3-642-14553-7_7.

[ref10] 10. Bulteau L, Fertin G, Komusiewicz C, Rusu I. A Fixed-Parameter Algorithm for Minimum Common String Partition with Few Duplications. In: Darling A, Stoye J, editors. Algorithms in Bioinformatics. vol. 8126 of Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2013. p. 244–258. Available from: http://dx.doi.org/10.1007/978-3-642-40453-5_19.

[ref11] 11. Bulteau L, Komusiewicz C. 8. In: Minimum Common String Partition Parameterized by Partition Size Is Fixed-Parameter Tractable;. p. 102–121. Available from: http://epubs.siam.org/doi/abs/10.1137/1.9781611973402.8.

[ref12] 12. He D. A Novel Greedy Algorithm for the Minimum Common String Partition Problem. In: Mǎandoiu I, Zelikovsky A, editors. Bioinformatics Research and Applications. vol. 4463 of Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2007. p. 441–452. Available from: http://dx.doi.org/10.1007/978-3-540-72031-7_40.

[ref13] 13. Blum C, Lozano J, Pinacho Davidson P. Iterative Probabilistic Tree Search for the Minimum Common String Partition Problem. In: Blesa M, Blum C, Voß S, editors. Hybrid Metaheuristics. vol. 8457 of Lecture Notes in Computer Science. Springer International Publishing; 2014. p. 145–154. Available from: http://dx.doi.org/10.1007/978-3-319-07644-7_11.

[ref14] 14. Blum C, Lozano JA, Davidson P. Mathematical programming strategies for solving the minimum common string partition problem. European Journal of Operational Research. 2015;242(3):769–777. Available from: http://www.sciencedirect.com/science/article/pii/S0377221714008716.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref15] 15. Achterberg T. SCIP: Solving constraint integer programs. Mathematical Programming Computation. 2009;1(1):1–41. http://mpc.zib.de/index.php/MPC/article/view/4.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref16] 16. Villesen, P. FaBox: An online fasta sequence toolbox; 2007. Available from: http://www.birc.au.dk/software/fabox.

[ref17] 17. Stothard P. The sequence manipulation suite: JavaScript programs for analyzing and formatting protein and DNA sequences. BioTechniques. 2000 Jun;28(6). Available from: http://view.ncbi.nlm.nih.gov/pubmed/10868275.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref18] 18. Miltenberger M;. personal communication.

[ref19] 19. Ferdous, SM, Rahman, MS. A MAX-MIN Ant Colony System for Minimum Common String Partition Problem; 2014. Manuscript submitted for publication.

[ref20] 20. Ferdous SM, Rahman MS. A MAX-MIN Ant Colony System for Minimum Common String Partition Problem. CoRR. 2014;abs/1401.4539.

Abstract

Figures

1 Introduction

2 Related Works

3 Preliminaries

4 ILP Formulation

4.1 Explanation of the Formulation

Objective function.

Equality constraint.

Factorization constraint.

One to one match constraint.

Integrality constraint.

5 Experiments

5.1 Data sets

Group1-Group3.

Group4.

Group5.

Real.

5.2 Implementation

5.3 Results and Analysis

5.4 Running Time

6 Discussion

7 Conclusion and Future works

Supporting Information

S1 Dataset. Group1 dataset.

S2 Dataset. Group2 dataset.

S3 Dataset. Group3 dataset.

S4 Dataset. Group4 dataset.

S5 Dataset. Group5 dataset.

S6 Dataset. Real dataset.

S7 Dataset. 3000 length instance.

Acknowledgments

Author Contributions

References

Cookie Preference Center

Customize Your Cookie Preference