Automatic Clustering Using Multi-objective Particle Swarm and Simulated Annealing

Ahmad Abubaker; Adam Baharum; Mahmoud Alrefaei

doi:10.1371/journal.pone.0130995

Abstract

This paper puts forward a new automatic clustering algorithm based on Multi-Objective Particle Swarm Optimization and Simulated Annealing, “MOPSOSA”. The proposed algorithm is capable of automatic clustering which is appropriate for partitioning datasets to a suitable number of clusters. MOPSOSA combines the features of the multi-objective based particle swarm optimization (PSO) and the Multi-Objective Simulated Annealing (MOSA). Three cluster validity indices were optimized simultaneously to establish the suitable number of clusters and the appropriate clustering for a dataset. The first cluster validity index is centred on Euclidean distance, the second on the point symmetry distance, and the last cluster validity index is based on short distance. A number of algorithms have been compared with the MOPSOSA algorithm in resolving clustering problems by determining the actual number of clusters and optimal clustering. Computational experiments were carried out to study fourteen artificial and five real life datasets.

Citation: Abubaker A, Baharum A, Alrefaei M (2015) Automatic Clustering Using Multi-objective Particle Swarm and Simulated Annealing. PLoS ONE 10(7): e0130995. https://doi.org/10.1371/journal.pone.0130995

Editor: Yong Deng, Southwest University, CHINA

Received: December 7, 2014; Accepted: May 27, 2015; Published: July 1, 2015

Copyright: © 2015 Abubaker et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: The authors have no support or funding to report.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Data clustering is an important task in the field of unsupervised datasets. The clustering technique distributes the dataset into clusters of similar features [1]. To solve a clustering problem, the number of clusters that fits a dataset must be determined, and the objects for these clusters must be assigned appropriately. The number of clusters may or may not be known, thereby making it difficult to find the best solution to the clustering problem. As such, the clustering problem can be viewed as an optimization problem. This challenge has led to the proposal of many automatic clustering algorithms in previous literature; these algorithms estimate the appropriate number of clusters and appropriately partition a dataset into these clusters without the need to know the actual number of clusters [2–8]. Most of these algorithms rely exclusively on one internal evaluation function (validity index). The validity index has an objective function to evaluate the various characteristics of clusters, which illustrates the clustering quality and accuracy of the clustering solutions [9]. Nevertheless, the single evaluation function is often ineligible to determine the appropriate clusters for a dataset, thus giving an inferior solution [10]. Accordingly, the clustering problem is structured as a multi-objective optimization problem wherein different validity indices can be applied and evaluated simultaneously.

Several automatic multi-objective clustering algorithms are proposed in literature to solve the clustering problem. Evolution appeared in this area after Handl and Knowles [3] proposed an evolutionary approach called multi-objective clustering with automatic K determination (MOCK). For some of the automatic multi-objective clustering algorithms related to MOCK, can refer to [11–13]. A multi-objective clustering technique inspired by MOCK named VAMOSA, which is based on simulated annealing as the underlying optimization strategy and the point symmetry-based distance, was proposed by Saha and Bandyopadhyay [5].

How to deal with various shapes of datasets (hyper spheres, linear, spiral, convex, and non-convex), overlapping datasets, datasets with a small or large number of clusters, and datasets that have objects with small or large dimensions without providing the proper clustering or knowing the cluster number is a challenge. Saha and Bandyopadhyay [8] developed two multi-objective clustering techniques (GenClustMOO and GenClustPESA2) by using a simulated annealing-based multi-objective optimization technique and the concept of multiple centers to each cluster that can deal with different types of cluster structures. GenClustMOO and GenClustPESA2 were compared with MOCK [3], VGAPS [4], K-means (KM) [14], and single-linkage clustering technique (SL) [15] using numerous artificial and real-life datasets of diverse complexities. However, these algorithms did not give the desired high accuracy in clustering datasets.

The current study proposes an automatic clustering algorithm, namely, hybrid multi-objective particle swarm optimization with simulated annealing (MOPSOSA), which deals with different sizes, shapes, and dimensions of datasets and an unknown number of clusters. The Numerical results of the proposed algorithm are shown to perform better than those of the GenClustMOO [8] and GenClustPESA2 [8] methods in terms of clustering accuracy (see the Results and Discussions Section). In order to deal with any dataset and qualification to determine appropriate clusters and obtain good solutions with high accuracy, combinatorial particle swarm optimization II [7]is developed to deal with three different cluster validity indices, simultaneously. The first cluster validity index is the Davies-Bouldin index (DB-index) [16], which is based on Euclidean distance; the second one is symmetry-based cluster validity index (Sym-index) [4], which is based on point symmetry distance; and the last one is a connectivity-based cluster validity index (Conn-index) [17], which is based on short distance. If no change exists in a particle position or when it is moved to a bad position, then the MOPSOSA algorithm uses MOSA [18] to improve the searching particle. The MOPSOSA algorithm also utilizes KM method [14] to improve the selection of the initial particle position because of its significance in the overall performance of the search process. It creates a large number of Pareto optimal solutions through a trade-off between the three different validity indices. Therefore, the idea of sharing fitness [19] is incorporated in the proposed algorithm to maintain diversity in the repository that contains Pareto optimal solutions. Pareto optimal solutions are important for decision makers to choose from. Furthermore, to comply with the decision-maker requirements, the proposed algorithm utilizes a semi-supervised method [20] to provide a single best solution from the Pareto set. The performance of MOPSOSA is compared with the performances of three automatic multi-objective clustering techniques, namely, GenClustMOO [8], GenClustPESA2 [8], and MOCK [3], and with those of three single-objective clustering techniques, namely, VGAPS [4], KM [14], and SL [15], using 14 artificial and 5 real-life datasets.

The reminder of this paper is structured as follows; Section 2 describes the multi-objective clustering problem; Section 3 illustrates the proposed MOPSOSA algorithm in details; Section 4 presents the datasets used in the numerical experiments, the evaluation of clustering quality, and the setting of the parameters for the MOPSOSA algorithm; Section 5 includes discussion of the results; Finally, concluding remarks are given in Section 6.

Clustering Problem

The clustering problem is defined as follows: Consider the dataset P = {p₁,p₂,…,p_n}, where p_i = (p_i1,p_i2,…,p_id) is a feature vector of d-dimensions and also referred to as the object, p_ij is the feature value of object i at dimension j, and n is the number of objects in P. The clustering of P is the partitioning of P into k clusters {C₁,C₂,…,C_k} with the following properties: (1) (2) (3)

The clustering optimization problem with one objective function for the clustering problem can be formed as follows: such that Eqs (1) to (3) are satisfied, where f is the validity index function, Θ is the feasible solutions set that contains all possible clustering for the dataset P of n objects into k clusters, C = {C₁,C₂,…,C_k} and k = 2,3,…,n‒1.

The multi-objective clustering problem for S different validity indices is defined as follows: (4) where F(C) is a vector of S validity indices. Note that there may be no solution that minimizes all the functions f_i(C). Therefore, the aim is to identify the set of all non-dominant solutions.

Definition: Consider C and C* as two solutions in the feasible solutions set Θ, the solution C is said to be dominated by the solution C* if and only if f_i(C*) ≤ f_i(C), ∀ i = 1,…,S and f_i(C*) < f_i(C) for at least one i. Otherwise, C is said to be non-dominated by C*.

The Pareto optimal set is a set that includes all non-dominated solutions in the feasible solutions set Θ.

The Proposed MOPSOSA Algorithm

Simulated annealing requires more calculation time than does particle swarm optimization [21]. The former requires low variations of temperature parameters to obtain a global solution [22]. Some of the particles may become stagnant and remain unchanged, especially when the objective functions of the best personal position and the best global position are similar [21]. As such, the particle cannot jump out, which in turn causes convergence toward the local solution and the loss of its capability to search for the optimal Pareto set. This phenomenon is a disadvantage in comparison with simulated annealing, which can jump away from a local solution. The proposed MOPSOSA algorithm, as previously mentioned, is a hybrid algorithm that merges the advantages of fast calculation and convergence in particle swarm optimization with the capability to evade local solutions in simulated annealing.

The clustering solution X_i is described using label-based integer encoding [23]. Each particle position is a clustering solution. The particle position and velocity are presented as vectors with n components and at time t, i = 1,…,m, where n is the number of data objects, and m is the number of particles (swarm size). The position component represents the cluster number of j^th object in i^th particle, and represents the motion of j^th object in i^th particle, where is the number of clusters related to particle i at time t (where K_min and K_max are the minimum and maximum number of clusters, respectively; the default value of K_min is 2; and K_max is unless it is manually specified) [24]. The best previous position of i^th particle at iteration t is represented as . The leader position chosen from the repository of Pareto sets for i^th particle at iteration t is represented by .

The flowchart in Fig 1 illustrates the general process of the MOPSOSA algorithm. The process of the algorithm is described in the following 11 steps:

Step 1: The algorithm parameters, such as swarm size m, number of iterations Iter, maximum and minimum numbers of clusters, velocity parameters, initial cooling temperature T₀, and t = 0, are initialized.
Step 2: The initial particle position using KM method [14], initial velocity , and initial , i = 1,…,m are generated.
Step 3: The objective functions , i = 1,…,m, where S is the number of objective functions, are computed. The repository of Pareto sets is filled with all non-dominated , i = 1,…,m based on a fitness-sharing basis.
Step 4: The leader from the repository of Pareto sets nearest to current is selected. The clusters in and are renumbered on the basis of their similarity to the clusters in , i = 1,…,m.
Step 5: The new Vnew_i and Xnew_i, i = 1,…,m, are computed using , , , and .
Step 6: The validity of Xnew_i, i = 1,…,m is checked, and the correction process is applied if it is not valid.
Step 7: The objective functions f₁(Xnew_i),…,f_s(Xnew_i) and , i = 1,…,m are computed.
Step 8: A dominance check for Xnew_i, i = 1,…,m is performed, that is, if Xnew_i is non-dominated by , then and ; otherwise, the MOSA technique is applied and and , i = 1,…,m, where and are the position and velocity particles respectively obtained by applying the MOSA technique. The MOSA is discussed in details in section MOSA Technique below. Upon completion of the generation of new positions for all particles, the cooling temperature T_t+1 is updated.
Step 9: The new , i = 1,…,m is identified.
Step 10: The Pareto set repository is updated.
Step 11: t = t + 1 is set; if t ≥ Iter, then the algorithm is stopped and the Pareto set repository contains the Pareto solutions; otherwise, go to step 4.

Download:

Fig 1. Flowchart for the proposed MOPSOSA algorithm.

https://doi.org/10.1371/journal.pone.0130995.g001

The following sections will elucidate the steps of the MOPSOSA algorithm.

Particles swarm initialization

Initial particles are generally considered one of the success factors in particle swarm optimization that affect the quality of the solution and the speed of convergence. Hence, the MOPSOSA algorithm employs KM method as a means to improve the generation of the initial swarm of particles. Fig 2 depicts a flowchart for the generation of m particles. Starting with i = 1 and W = min{K_max−K_min+1,m}, if W = m, then m particles will be generated by KM method with the number of clusters K_i = K_min+i−1, i = 1,…,m. If W = K_max−K_min+1, then the first W particles will be generated by KM with the number of clusters K_i = K_min+i−1, i = 1,…,W, and the other particle will be generated by KM with the number of clusters K_i, i = W+1,…m selected randomly between K_min and K_max. For each particle, the initial velocities are selected to be zero V_i = 0, i = 1,…,m, and the initial XP_i is equal to the current position X_i for all i = 1,…,m.

Download:

Fig 2. Flowchart for initializing particle swarm.

https://doi.org/10.1371/journal.pone.0130995.g002

Objective functions

The proposed algorithm uses three types of cluster validity indices as objective functions to achieve optimization. These validity indices, DB-index, Sym-index, and Conn-index, apply three different distances, namely, Euclidean distance, point symmetric distance, and short distance, respectively. Each validity index indicates a different aspect of good solutions in clustering problems. These validity indices are described below.

DB-index.

This index was developed by Davies—Bouldin [16] which is a function of the ratio of the sum of within-cluster objects (intra-cluster distance) and between cluster separation (inter-cluster distance). The within i^th cluster C_i, S_i,q is calculated using Eq (5). The distance between clusters C_i and C_j is denoted by d_ij,t, which is computed using Eq (6). (5) (6) where n_i = |C_i| is the number of objects in cluster C_i, c_i is the cluster center of cluster C_i and is defined as , and q and t are positive integer numbers. DB is defined as: (7) where . A small value of DB means a good clustering result.

Sym-index.

The recently developed point symmetry distance d_ps(p,c) is employed in this cluster validity index Sym, which measures the overall average symmetry in connection with the cluster centers [4]. It is defined as follows. Let p be a point, and the reflected symmetrical point of p with respect to a specific center c is 2c − p and is denoted by p*. Let knear unique nearest neighbors to p* be at the Euclidean distances of d_i, i = 1,…,knear. The point symmetric distance is defined as: (8) where d_e(p,c) is the Euclidean distance between the point p and the center c and d_sym(p,c) is a symmetric measure of p with respect to c, which is defined as . In this study, knear = 2. The cluster validity function is defined as (9) where , , is the j^th object of cluster i, and is the maximum Euclidean distance between the two centers among all cluster pairs. Eq (8) is used with some constraint to compute . The knear nearest neighbors of and should belong to the i^th cluster, where is the reflected point of the point with respect to c_i. A large value for Sym-index means that the actual number of clusters and proper partitioning are obtained.

Conn-index.

The third cluster validity index used in this study is proposed by Saha and Bandyopadhyay [17], it depends on the notion of cluster connectedness. To compute Conn-index, the the relative neighborhood graph [25] structuring for the dataset has to be conducted first. Subsequently, the short distance between two points x and y is denoted by d_short(x,y) and is defined as follows: (10) where npath is the number of all paths between x and y in the RNG structuring; ned_i is the number of edges along i^th path, i = 1,…,npath; is j^th edge in i^th path, j = 1,…,ned_i and i = 1,…,npath; and is the edge weight of the edge . The edge weight is equal to the Euclidean distance between a and b, d_e(a,b), where a and b are the end points of the edge .

The cluster validity index Conn developed by Saha and Bandyopadhyay [17] is defined as follows: (11) where m_i is the medoid of the i^th cluster that is equal to the point with the minimum average distance to all points in the i^th cluster , and . The minimum value of Conn-index means the clusters interconnected internally and separately from each other.

After the particles have been moved to a new position, the three objective functions are computed for each particle in the swarm. The objective functions for a particle position X are {DB(X),1/Sym(X),Conn(X)}. The three objectives are minimized simultaneously using MOPSOSA algorithm.

XP updating

The previous best position of i^th particle at iteration t is updated by non-dominant criteria. is compared with the new position . Three cases of this comparison are considered.

If is dominated by , then .
If is dominated by , then .
If and are non-dominated, then one of them will be chosen randomly as .

This update occurs on each particle.

Repository updating

The repository is utilized as a guide by MOPSOSA algorithm for the swarm toward the Pareto front. The non-dominated particle positions are stored in the repository. To preserve the diversity of non-dominated solutions in the repository, sharing fitness [19] is a good method to control the acceptance of new entries into the repository when it is full. Fitness sharing was used by Lechuga and Rowe [26] in multi-objective particle swarm optimization. In each iteration, the new non-dominated solutions are added into the external repository and elimination of the dominated solutions. In case the non-dominated solutions are increased than the size of the repository, the fitness sharing is calculated for all non-dominated solutions. The solutions that have largest values of fitness sharing are selected to fill the repository.

Cluster re-numbering

The re-numbering process is designed to eliminate the redundant particles that represent the same solution. The proposed MOPSOSA algorithm employs the re-numbering procedure designed by Masoud et al. [7]. This procedure uses a similarity function to measure the degree of similarity between the clusters of two input solutions and (or ). The two clusters that are most similar are matched. Any cluster in (or ) not matched to any cluster will use the unused number in the clustering numbering. MOPSOSA algorithm uses the similarity function known as Jaccard coefficient [27], which is defined as follows: (12) where C_j is j^th cluster in , is k^th cluster in , n₁₁ is the number of objects that exist in both C_j and , n₁₀ is the number of objects that exist in C_j but does not exist in , and n₀₁ is the number of objects that do not exist in C_j but exist in .

Velocity computation

MOPSOSA algorithm employs the expressions and operators modified by Masoud et al. [7]. The new velocity for particle i at iteration t is calculated as follows: (13) where W, R₁, and R₂ are the vectors of n components with values 0 or 1 that are generated randomly with a probability of w, r₁, and r₂, respectively. The operations ⊗, ⊕, and ⊖ are the multiplication, merging, and difference, respectively.

Difference operator⊖

The difference operation calculates the difference between and (or XG^t). Let , and be defined as follows: (14) (15)

Multiplication operator ⊗

The multiplication operator is defined as follows: let A = (a₁,…,a_n) and B = (b₁,…,b_n) are two vectors of n components, then A⊗B = (a₁b₁,…,a_nb_n).

Merging operator ⊕

The merging operator is defined as follows: let A = (a₁,…,a_n) and B = (b₁,…,b_n) be two vectors of n components, then C = A⊕B = (c₁,c₂,…,c_n), where (16)

Position computation

MOPSOSA algorithm employs the definition to generate new positions, as proposed by Masoud et al. [7]. The new position is generated from the velocity as follows: (17) where r is an integer random number in and . This property enables the particle to add new clusters. The previous operators and the differences in cluster number of , , and XG^t lead to the addition or removal of some of the clusters in the output of the new position . Sometimes an empty cluster may exist, which leads to invalid particle position. Such an instance can be avoided by exposing the particle to reset the numbering clusters. The re-numbering process works by encoding the largest cluster number to the smallest unused one.

MOSA technique

MOSA method [18] is applied in the MOPSOSA algorithm at iteration t for particle i in case dominates the new position Xnew_i. Fig 3 presents the flowchart for the MOSA technique applied in MOPSOSA. The procedure for the MOSA technique is explained in eight steps below.

Download:

Fig 3. Flowchart for the MOSA technique applied in MOPSOSA.

https://doi.org/10.1371/journal.pone.0130995.g003

Step 1: Let PSX and PSV be two empty sets, niter is a maximum number of iteration, and q = 0.
Step 2: Evaluate , where the cooling temperature T_t is updated in step 8 of MOPSOSA algorithm. Generate uniform random number u∈(0,1), if u<EXP_q, go to step 7. Otherwise, proceed to the next step.
Step 3: Add Xnew_i to PSX and Vnew_i to PSV, then PSX and PSV are updated to include only non-dominant solutions.
Step 4: If q≥niter, then choose a solution randomly from PSX as the new particle position Xnew_i and the corresponding velocity Vnew_i from PSV, and proceed to step 7. Otherwise, q = q+1, and generate the new velocity Vnew_i and position Xnew_i from the old position .
Step 5: Calculate the objective function f₁(Xnew_i),…,f_s(Xnew_i), and .
Step 6: Perform a dominance check for Xnew_i, if Xnew_i is non-dominated by , then proceed to step 7. Otherwise go to step 2.
Step 7: The new position and velocity Xnew_i and Vnew_i are accepted as the new generation of and , respectively, and
Step 8: Check the validity for , and apply the re-numbering process if it is invalid. Return and .

Selection of the best solution

In general, a Pareto set containing several non-dominated solutions is provided on the final run of multi-objective problems [28]. Each non-dominated solution introduces a pattern of clustering for the given dataset. The semi-supervised method proposed by Saha and Bandyopadhyay [20] is utilized in the MOPSOSA algorithm to select the best solution from the Pareto optimal set. This semi-supervised approach can only be applied when the cluster labels of some points in the dataset are known. The misclassification value is computed by using the Minkowski score MS [29]. Let T be the actual solution and C be the selected solution; hence, MS is defined as follows: (18)

The low values of MS are “better” with the optimal value for MS set as 0.

Experimental Study

This section presents the datasets used for the experiment, the measurement of the accuracy solution, and parameters settings of the proposed algorithm.

Experimental datasets

The MOPSOSA algorithm is examined on 14 artificial and 5 real-life datasets (S1 File). Table 1 displays the types of datasets, the number of points (objects), the dimensions (features), and the number of clusters. Further details on these datasets are provided below.

Download:

Table 1. Description of the artificial and real-life datasets.

https://doi.org/10.1371/journal.pone.0130995.t001

Artificial datasets
1. Sph_5_2 [2] dataset (Appendix A in S1 File): This dataset consists of 250-point 2D distributed over five overlapping spherically shaped clusters. Each cluster contains 50 points. Fig 4a illustrates this dataset.
2. Sph_4_3 [2] dataset (Appendix B in S1 File): This dataset demonstrated in Fig 4b comprises 400-point 3D distributed over four disjointed hyper spherically shaped clusters. Each cluster contains 100 points.
3. Sph_6_2 [2] dataset (Appendix C in S1 File): This dataset involves 300-point 2D distributed over six different clusters. Each cluster embodies 50 points. This dataset is depicted in Fig 4c.
4. Sph_10_2 [30] dataset (Appendix D in S1 File): This dataset accommodates 500-point 2D distributed over 10 different clusters, of which some are overlapping. Each cluster holds 50 points. This dataset is shown in Fig 4d.
5. Sph_9_2 [30] dataset (Appendix E in S1 File): Specified in Fig 4e, this dataset embodies 900-point 2D distributed over nine highly overlapping clusters, in which each cluster incorporates 100 points.
6. Pat1 [31] dataset (Appendix F in S1 File): This dataset involves 557-point 2D distributed over three different clusters; one of these clusters is non-convex. This dataset is signified in Fig 4f.
7. Pat2 [31] dataset (Appendix G in S1 File): This dataset contains 417-point 2D distributed over two nonlinear, non-symmetric, and non-overlapping clusters. Fig 4g shows this dataset.
8. Long1 [3] dataset (Appendix H in S1 File): This dataset shown in Fig 4h encloses 1000-point 2D distributed over two long-shaped clusters.
9. Sizes5 [3] dataset (Appendix I in S1 File): This dataset comprises 1000-point 2D distributed over four square-shaped clusters, one of which contains more points than the others. Fig 4i displays this dataset.
10. Spiral [3] dataset (Appendix J in S1 File): This dataset exhibited in Fig 4j consists of 1000-point 2D distributed over two spiral-shaped clusters.
11. Square1 [3] dataset (Appendix K in S1 File): This dataset includes 1000-point 2D distributed over four semi-overlapping square-shaped clusters. Each cluster contains 250 points. This dataset is shown in Fig 4k.
12. Square4 [3] dataset (Appendix L in S1 File): Specified in Fig 4l, this dataset comprises 1000-point 2D distributed over four overlapping square-shaped clusters, each containing 250 points.
13. Twenty [3] dataset (Appendix M in S1 File): This dataset incorporates 1000-point 2D distributed over 20 small clusters. Each cluster contains 50 points. This dataset is shown in Fig 4m.
14. Fourty [3] dataset (Appendix N in S1 File): This dataset exhibited in Fig 4n consists of 1000-point 2D distributed over 40 small clusters. Each cluster contains 25 points.
Real-life datasets
1. Iris [32] dataset (Appendix O in S1 File): This dataset comprises 150 four-feature samples distributed over three clusters each containing 50 observations. These samples are obtained from different categories of the iris flower (i.e., Setosa, Versicolor, and Virginica). Each sample has four feature values: sepal length, sepal width, petal length, and petal width. Two clusters of the iris flower (Versicolor and Virginica) are highly overlapping.
2. Cancer [32] dataset (Appendix P in S1 File): This dataset consists of 683 samples with nine laboratory tests distributed over two clusters. Procured from Wisconsin Breast Cancer, these samples consist of two categories, malignant and benign, which are known to be linearly separable.
3. Newthyroid [32] dataset (Appendix Q in S1 File): This dataset incorporates 215 instances with five laboratory tests distributed over three clusters. These samples are labeled as “Thyroid gland data,” which embody three categories (i.e., normal, hypo, and hyper).
4. LiverDisorder [32] dataset (Appendix R in S1 File): This dataset represents 345 instances with six laboratory tests distributed over two clusters. The task is to determine whether a person suffers from alcoholism.
5. Glass [32] dataset (Appendix S in S1 File): This dataset involves 214 samples with nine features distributed over six clusters. The field of criminological investigations has motivated the study on classifying the types of glass. At the scene of the crime, a glass left can provide evidence if it is correctly identified. In this dataset, the 10th feature (ID number) has been removed.

Download:

Fig 4. Graphs of the artificial datasets.

(a) Sph_5_2. (b) Sph_4_3. (c) Sph_6_2. (d) Sph_10_2. (e) Sph_9_2. (f) Pat1. (g) Pat2. (h) Long1. (i) Sizes5. (j) Spiral. (k) Square1. (l) Square4. (m) Twenty. (n) Fourty.

https://doi.org/10.1371/journal.pone.0130995.g004

Evaluating the clustering quality

An external criterion of the clustering quality for evaluating the results is presented in this section. The F-measure [33] is selected to compute the final solution obtained from the MOPSOSA, GenClustMOO, GenClustPESA2, MOCK, VGAPS, KM, and SL clustering algorithms. Let T and C be the two clustering solutions, be the truth solution, and be the solution to be measured, where k_T and K_C are number of clusters for the solutions T and C respectively. The F-measure of classes T_i and cluster C_j are defined as follows: (19) where P(T_i,C_j) = n_ij/|C_j| and R(T_i,C_j) = n_ij/|T_i| Meanwhile, the F-measure of solutions T and C are construed below: (20) where n is the number of the dataset. Higher values of F(T,C) are better values, and the optimal value of F(T,C) is 1.

Parameter settings

Table 2 presents the parameter settings employed in the proposed MOPSOSA algorithm. The performance of this algorithm is compared with three multi-objective automatic and three single-objective clustering algorithms (i.e., GenClustMOO, GenClustPESA2, MOCK, VGAPS, KM, and SL). These algorithms and the proposed algorithm are executed on all the above mentioned datasets. Employing semi-supervised method [20], the GenClustMOO and GenClustPESA2 algorithms select the best solutions from the final Pareto set. Additional details on the standard parameters employed in these algorithms can be acquired in Saha and Bandyopadhyay [8]. In the MOCK algorithm, GAP statistics [34] is used to select the best solution. The source code of the standard parameters used in MOCK is available in [3]. VGAPS, KM, and SL clustering algorithms provide a single solution. In VGAPS, population size is equal to 100, the number of generation is equivalent to 60, and mutation and crossover probabilities are computed adaptively. The total computations implemented in the proposed algorithm, GenClustMOO, GenClustPESA2, MOCK, and VGAPS, as well as the number of iterations in KM and SL, are all equal. Each algorithm is implemented 30 times.

Download:

Table 2. Parameter settings used in MOPSOSA algorithm.

https://doi.org/10.1371/journal.pone.0130995.t002

Results and Discussions

For each algorithm, the average value of F-measure is calculated for the final best solution to compare and exhibit the performance of the proposed algorithm with that of other algorithms. More information about the results of the cluster number and F-measure values of GenClustMOO, GenClustPESA2, MOCK, VGAPS, KM, and SL on the specified datasets can be acquired from Saha and Bandyopadhyay [8]. Table 3 displays the best value of F-measure and the number of clusters for the datasets automatically obtained with MOPSOSA, GenClustMOO, GenClustPESA2, MOCK, and VGAPS automatic clustering techniques. KM and SL are implemented with the actual number of clusters on all datasets.

Download:

Table 3. F-measure value and the number of clusters for different datasets obtained by MOPSOSA compared with those acquired by GenClustMOO, GenClustPESA2, MOCK, and VGAPS algorithms.

https://doi.org/10.1371/journal.pone.0130995.t003

Discussion of the artificial datasets results

Sph_5_2: Table 4 displays that the maximum F-measure value for this dataset was obtained with the MOPSOSA algorithm even though existence five overlapping spherical clusters. However, MOPSOSA, GenClustMOO, GenClustPESA2, and VGAPS established the actual number of clusters as illustrated in Table 3. Fig 5a shows the clustering of this dataset after the MOPSOSA algorithm was applied.
Sph_4_3: The actual number for this dataset was detected with the MOPSOSA, GenClustMOO, GenClustPESA2, MOCK, and VGAPS clustering algorithms. All seven algorithms also achieved an F-measure value of 1, providing 100% accuracy for the clustering of this dataset (refer to Tables 3 and 4). Fig 5b exhibits the graph of clusters Sph_4_3 after the MOPSOSA algorithm was employed.
Sph_6_2: The F-measure value for this dataset was determined to be 1 for the seven algorithms (Table 4), signifying the accurate performance of all algorithms. Moreover, all algorithms attained the real number of clusters as demonstrated in Table 3. Fig 5c depicts the graph of the clusters for this dataset with the application of the MOPSOSA algorithm.
Sph_10_2: Table 3 reveals that only the MOPSOSA and GenClustMOO clustering algorithms achieved the desired number of clusters for this dataset. However, a maximum F-measure value was obtained with MOPSOSA (refer to Table 4) despite some overlap in these datasets. Fig 5d shows the graph for the clustering of Sph_10_2 with the post-application of the MOPSOSA algorithm.
Sph_9_2: For this dataset, Table 3 shows that MOPSOSA, GenClustMOO, MOCK, and VGAPS, except GenClustPESA2, were identified to be highly efficient in detecting the actual number of clusters. Despite the existence overlaps in all clusters for this dataset, MOPSOSA obtained a maximum F-measure value, demonstrating the accuracy of its performance (refer to Table 4). Fig 5e illustrates the dataset clustering with the MOPSOSA algorithm.
Pat1: Table 4 demonstrates that only MOPSOSA achieved the maximum F-measure value for this dataset, indicating a high accuracy clustering for well-separated clusters and for clusters of various shapes. Nevertheless, MOPSOSA, GenClustMOO, and GenClustPESA2 clustering algorithms attained the real number of clusters (Table 3), whereas MOCK was observed inappropriate for this dataset. The three clusters are clearly depicted in Fig 5f after the algorithm was applied on this dataset.
Pat2: Tables 3 and 4 show that the MOPSOSA, GenClustMOO, and GenClustPESA2 clustering algorithms obtained the real number of clusters for this dataset with the F-measure value as 1, signifying the high clustering accuracy of these algorithms in clustering these nonlinear and non-spherically dataset. Fig 5g reveals the graph of the two clusters in Pat2 with the application of the MOPSOSA algorithm.
Long1: For this dataset, MOPSOSA, GenClustMOO, GenClustPESA2, MOCK, and SL acquired the F-measure value of 1. Meanwhile, MOPSOSA, GenClustMOO, GenClustPESA2, and MOCK automatically resolved the proper cluster numbers for this dataset (refer to Tables 3 and 4). Fig 5h presents the clustering of this dataset into two correct clusters with the application of the MOPSOSA algorithm.
Sizes5: Table 4 reveals the maximum F-measure value obtained with the MOPSOSA algorithm for this dataset, which indicates that the proposed algorithm is qualified to clustering a dataset with different sizes of clusters. Regardless, Table 3 specifies that both MOPSOSA and GenClustMOO identified the actual number of clusters. Fig 5i shows the result of clustering on this dataset with the application of the MOPSOSA algorithm.
Spiral: Table 4 indicates that an F-measure value of 1 was acquired by MOPSOSA, GenClustMOO, and GenClustPESA2 for this dataset, indicating 100% accurate clustering on the spiral shapes. MOPSOSA, GenClustMOO, and GenClustPESA2 clustering algorithms also determined the real number of clusters as shown in Table 3. Fig 5j is a clear graphic illustration of the two spirals for this dataset with the application of the MOPSOSA algorithm.
Square1: For this dataset, all five automatic clustering algorithms (MOPSOSA, GenClustMOO, GenClustPESA2, MOCK, and VGAPS) detected the appropriate number of clusters (refer to Tables 3 and 4) and obtained the maximum F-measure value, thereby indicating their high accuracy in clustering this dataset. Fig 5k illustrates the result of clustering Square1 into four clusters by applying the MOPSOSA algorithm.
Square4: Table 3 exhibits that, for this dataset, MOPSOSA, GenClustMOO, GenClustPESA2, and MOCK, except VGAPS, established the actual number of clusters, with the maximum F-measure value obtained via MOPSOSA (see Table 4). The proposed algorithm was capable to clustering this data with high accuracy even though there are four overlapping clusters. The graph for the clustering of this dataset using the MOPSOSA algorithm is depicted in Fig 5l.
Twenty: For this dataset, MOPSOSA, GenClustMOO, MOCK, and VGAPS determined the real number of clusters (see Table 3), except GenClustPESA2. However, MOPSOSA, GenClustMOO, and MOCK obtained an F-measure value of 1, demonstrating an extremely high clustering accuracy even for several clusters (refer to Table 4). The clusters for this dataset after the application of MOPSOSA algorithm is graphically shown in Fig 5m.
Fourty: Table 3 reveals that for this dataset, only three automatic clustering algorithms (MOPSOSA, GenClustMOO, and MOCK) identified the desired cluster number. All these algorithms also obtained the F-measure value of 1, demonstrating an exceedingly high clustering accuracy despite the large number of clusters (refer to Table 4). Fig 5n depicts the graph for clustering this dataset after the application of the MOPSOSA algorithm.

Download:

Table 4. Averages and standard deviations for the F-measure values on the different datasets obtained from MOPSOSA, GenClustMOO, GenClustPESA2, MOCK, VGAPS, KM, and SL algorithms.

https://doi.org/10.1371/journal.pone.0130995.t004

Download:

Fig 5. Graphs of the artificial datasets after applying the MOPSOSA algorithm.

(a) Sph_5_2. (b) Sph_4_3. (c) Sph_6_2. (d) Sph_10_2. (e) Sph_9_2. (f) Pat1. (g) Pat2. (h) Long1. (i) Sizes5. (j) Spiral. (k) Square1. (l) Square4. (m) Twenty. (n) Fourty.

https://doi.org/10.1371/journal.pone.0130995.g005

Discussion of the real-life datasets results

Iris: Table 4 shows that for this dataset, the maximum F-measure value was obtained with the proposed algorithm MOPSOSA. However, with the exception of MOCK, all four automatic clustering algorithms (MOPSOSA, GenClustMOO, GenClustPESA2, and VGAPS) resolved the proper number of clusters, as evidenced in Table 3.
Cancer: The maximum F-measure value for this dataset was obtained with the proposed MOPSOSA algorithm (see Table 4). Nevertheless, all five automatic clustering algorithms (MOPSOSA, GenClustMOO, GenClustPESA2, MOCK, and VGAPS) identified the correct number of clusters, as illustrated in Table 3.
Newthyroid: Table 4 reveals that the maximum F-measure value for this dataset was attained with the MOPSOSA algorithm. However, Table 3 specifies that only two automatic clustering algorithms (MOPSOSA and GenClustMOO) determined the actual number of clusters.
Liver Disorder: For this dataset, MOPSOSA, GenClustMOO, MOCK, and VGAPS, except GenClustPESA2, identified the actual number of clusters (refer to Table 3). Meanwhile, the maximum F-measure was achieved with the proposed algorithm MOPSOSA (refer to Table 4).
Glass: Table 4 demonstrates that the maximum F-measure value for this dataset was obtained with the MOPSOSA algorithm. Only MOPSOSA and GenClustMOO automatic clustering algorithms were determined to be capable of achieving the desired number of clusters (see Table 3).

Summary of results

The above results signify that the proposed MOPSOSA algorithm achieves accurate results in all datasets. Moreover, the proposed algorithm can automatically establish the correct cluster numbers for all datasets used in the experiment. The algorithm is also proven capable of dealing with various shapes of datasets (hyper spheres, linear, and spiral), overlapping datasets, datasets that have well-separated clusters with any convex and non-convex shapes, and datasets that contain several clusters. With most datasets having dimensions from 2 to 9, objects from 150 to 1000, and number of clusters from 2 to 40, the MOPSOSA algorithm displays superiority over the three multi-objective automatic and three single-objective clustering algorithms. The results also show that the GenClustMOO algorithm can automatically identify the actual cluster numbers, but with a lower quality of clustering accuracy than the proposed algorithm. In general, MOCK can detect the number of clusters for hyper spheres and linear, but it is unsuccessful for non-convex well-separated and overlapping clusters. The results also prove that the VGAPS algorithm is not suitable for non-convex well-separated clusters or for datasets with numerous clusters.

The main factors that led to the accuracy of the proposed algorithm in solving the clustering problem are attributed to the power and speed of the search characterized by the particle swarm, with the guarantee of not becoming stagnant into local solutions via the MOSA algorithm. The development of particle swarm to address more than one validity index can cluster any dataset. The generation of the initial swarm of particles can be improved with KM method. Meanwhile, the repository for preserving the diversity of clustering solutions can be updated by adopting the sharing fitness, and the redundant particles can be eliminated with the re-numbering process.

Conclusion

This research proposed a new automatic multi-objective clustering algorithm MOPSOSA based on a hybrid multi-objective particle swarm algorithm and multi-objective simulated annealing. A multi-objective particle swarm optimization was also developed from a combinatorial particle swarm optimization. The proposed algorithm was proven capable of automatically clustering the dataset into the appropriate number of clusters. With the simultaneous optimization of three objective functions, the Pareto optimal set was obtained from the proposed algorithm. The first objective function considered the compactness of the clustering based on Euclidean distance. The second function regarded the total symmetry of the clusters, and the third considered the connectedness of the clusters. The proposed algorithm was performed on 19 real-life and artificial datasets, and its performance was compared with that of three multi-objective automatic and three single-objective clustering techniques. MOPSOSA obtained better accuracy in its results compared to that of other algorithms. The results also demonstrated that the proposed algorithm can be used for datasets of various shapes and for overlapping and non-convex datasets.

Supporting Information

S1 File. Experimental datasets.

250 points of the artificial datasets Sph_5_2 (Appendix A). 400 points of the artificial datasets Sph_4_3 (Appendix B). 300 points of the artificial datasets Sph_6_2 (Appendix C). 500 points of the artificial datasets Sph_10_2 (Appendix D). 900 points of the artificial datasets Sph_9_2 (Appendix E). 557 points of the artificial datasets Pat1 (Appendix F). 417 points of the artificial datasets Pat2 (Appendix G). 1000 points of the artificial datasets Long1 (Appendix H). 1000 points of the artificial datasets Sizes5 (Appendix I). 1000 points of the artificial datasets Spiral (Appendix J). 1000 points of the artificial datasets Square1 (Appendix K). 1000 points of the artificial datasets Square4 (Appendix L). 1000 points of the artificial datasets Twenty (Appendix M). 1000 points of the artificial datasets Fourty (Appendix N). 150 samples of the real-life datasets Iris (Appendix O). 683 samples of the real-life datasets Cancer (Appendix P). 215 instances of the real-life datasets Newthyroid (Appendix Q). 345 instances of the real-life datasets LiverDisorder (Appendix R). 214 samples of the real-life datasets Glass (Appendix S).

https://doi.org/10.1371/journal.pone.0130995.s001

(PDF)

Author Contributions

Conceived and designed the experiments: AA. Performed the experiments: AA. Analyzed the data: AA. Contributed reagents/materials/analysis tools: AA. Wrote the paper: AA AB MA.

References

1. Jain AK, Dubes RC. Algorithms for clustering data. Upper Saddle River: Prentice hall Englewood Cliffs; 1988.
2. Bandyopadhyay S, Maulik U. Genetic clustering for automatic evolution of clusters and application to image classification. Pattern recognition. 2002;35(6):1197–208.
- View Article
- Google Scholar
3. Handl J, Knowles J. An evolutionary approach to multiobjective clustering. IEEE Transactions on Evolutionary Computation. 2007;11(1):56–76.
- View Article
- Google Scholar
4. Bandyopadhyay S, Saha S. A point symmetry-based clustering technique for automatic evolution of clusters. IEEE Transactions on Knowledge and Data Engineering. 2008;20(11):1441–57.
- View Article
- Google Scholar
5. Saha S, Bandyopadhyay S. A symmetry based multiobjective clustering technique for automatic evolution of clusters. Pattern recognition. 2010;43(3):738–51.
- View Article
- Google Scholar
6. Liu Y, Wu X, Shen Y. Automatic clustering using genetic algorithms. Applied mathematics and computation. 2011;218(4):1267–79.
- View Article
- Google Scholar
7. Masoud H, Jalili S, Hasheminejad SMH. Dynamic clustering using combinatorial particle swarm optimization. Applied intelligence. 2013;38(3):289–314.
- View Article
- Google Scholar
8. Saha S, Bandyopadhyay S. A generalized automatic clustering algorithm in a multiobjective framework. Applied Soft Computing. 2013;13(1):89–108.
- View Article
- Google Scholar
9. Halkidi M, Vazirgiannis M, editors. Clustering validity assessment: Finding the optimal partitioning of a data set. Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM 01); 2001; California, USA: Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM 01).
10. Suresh K, Kundu D, Ghosh S, Das S, Abraham A, Han SY. Multi-objective differential evolution for automatic clustering with application to micro-array data analysis. Sensors. 2009;9(5):3981–4004. pmid:22412346
- View Article
- PubMed/NCBI
- Google Scholar
11. Liu Y, Özyer T, Alhajj R, Barker K. Integrating Multi-Objective Genetic Algorithm and Validity Analysis for Locating and Ranking Alternative Clustering. Informatica (Slovenia). 2005;29(1):33–40.
- View Article
- Google Scholar
12. Matake N, Hiroyasu T, Miki M, Senda T, editors. Multiobjective clustering with automatic k-determination for large-scale data. Proceedings of the 9th annual conference on Genetic and evolutionary computation; 2007; London, England: In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2007).
13. Bandyopadhyay S, Saha S. GAPS: A clustering method using a new point symmetry-based distance measure. Pattern recognition. 2007;40(12):3430–51.
- View Article
- Google Scholar
14. MacQueen J, editor Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability; 1967; Oakland, CA, USA.: In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability.
15. Everitt BS, Landau S, Leese M. Cluster Analysis: Hodder Arnold, London 2001.
16. Davies DL, Bouldin DW. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1979;1(2):224–7. pmid:21868852
- View Article
- PubMed/NCBI
- Google Scholar
17. Saha S, Bandyopadhyay S. Some connectivity based cluster validity indices. Applied Soft Computing. 2012;12(5):1555–65.
- View Article
- Google Scholar
18. Ulungu B, Teghem J, Fortemps P. Heuristic for multi-objective combinatorial optimization problems by simulated annealing. In: Gu J, Chen G, Wei Q, Wang S, editors. MCDM: Theory and Applications: Sci-Tech; 1995. p. 229–38.
19. Goldberg DE, Richardson J, editors. Genetic algorithms with sharing for multimodal function optimization. Genetic algorithms and their applications: Proceedings of the Second International Conference on Genetic Algorithms; 1987; Cambridge, MA, USA Genetic algorithms and their applications: Proceedings of the Second International Conference on Genetic Algorithms.
20. Saha S, Bandyopadhyay S. A new multiobjective simulated annealing based clustering technique using symmetry. Pattern Recognition Letters. 2009;30(15):1392–403.
- View Article
- Google Scholar
21. Shieh H-L, Kuo C-C, Chiang C-M. Modified particle swarm optimization algorithm with simulated annealing behavior and its numerical verification. Applied Mathematics and Computation. 2011;218(8):4365–83.
- View Article
- Google Scholar
22. Mitra D, Romeo F, Sangiovanni-Vincentelli A. Convergence and Finite-Time Behavior of Simulated Annealing. Advances in Applied Probability. 1986;18(3):747–71.
- View Article
- Google Scholar
23. Hruschka ER, Campello RJGB, Freitas AA, De Carvalho APLF. A survey of evolutionary algorithms for clustering. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. 2009;39(2):133–55.
- View Article
- Google Scholar
24. Yang SL, Li YS, Hu XX, PAN RY. Optimization Study on k Value of K-means Algorithm. Systems Engineering-Theory & Practice. 2006;2:97–101. pmid:25546650
- View Article
- PubMed/NCBI
- Google Scholar
25. Toussaint GT. The relative neighbourhood graph of a finite planar set. Pattern recognition. 1980;12(4):261–8.
- View Article
- Google Scholar
26. Salazar-Lechuga M, Rowe JE, editors. Particle swarm optimization and fitness sharing to solve multi-objective optimization problems. Congress on Evolutionary Computation (CEC’2005); 2005; Edinburgh, Scotland, UK: Congress on Evolutionary Computation (CEC’2005). https://doi.org/10.1016/j.biosystems.2015.05.002 pmid:25982071
27. Jaccard P. Lois de distribution florale dans la zone alpine. Bull Soc Vaudoise Sci Nat. 1902;38:69–130.
- View Article
- Google Scholar
28. Deb K. Multi-objective optimization using evolutionary algorithms. New York: Wiley; 2001.
29. Jardine N, Sibson R. Mathematical taxonomy. New York: Wiley; 1971.
30. Bandyopadhyay S, Pal SK. Classification and learning using genetic algorithms: applications in bioinformatics and web intelligence: Springer Science & Business Media; 2007.
31. Pal SK, Mitra S. Fuzzy versions of Kohonen's net and MLP-based classification: performance evaluation for certain nonconvex decision regions. Information Sciences. 1994;76(3):297–337.
- View Article
- Google Scholar
32. Lichman M. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California School of Information and Computer Science. 2013.
33. Fung BC, Wang K, Ester M, editors. Hierarchical document clustering using frequent itemsets. SDM; 2003; San Francisco, CA: In Proceedings of the 3rd SIAM International Conference on Data Mining.
34. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B 2001;63(2):411–23.
- View Article
- Google Scholar

[ref1] 1. Jain AK, Dubes RC. Algorithms for clustering data. Upper Saddle River: Prentice hall Englewood Cliffs; 1988.

[ref2] 2. Bandyopadhyay S, Maulik U. Genetic clustering for automatic evolution of clusters and application to image classification. Pattern recognition. 2002;35(6):1197–208.
View Article
Google Scholar

[3] View Article

[4] Google Scholar

[ref3] 3. Handl J, Knowles J. An evolutionary approach to multiobjective clustering. IEEE Transactions on Evolutionary Computation. 2007;11(1):56–76.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref4] 4. Bandyopadhyay S, Saha S. A point symmetry-based clustering technique for automatic evolution of clusters. IEEE Transactions on Knowledge and Data Engineering. 2008;20(11):1441–57.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref5] 5. Saha S, Bandyopadhyay S. A symmetry based multiobjective clustering technique for automatic evolution of clusters. Pattern recognition. 2010;43(3):738–51.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref6] 6. Liu Y, Wu X, Shen Y. Automatic clustering using genetic algorithms. Applied mathematics and computation. 2011;218(4):1267–79.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref7] 7. Masoud H, Jalili S, Hasheminejad SMH. Dynamic clustering using combinatorial particle swarm optimization. Applied intelligence. 2013;38(3):289–314.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref8] 8. Saha S, Bandyopadhyay S. A generalized automatic clustering algorithm in a multiobjective framework. Applied Soft Computing. 2013;13(1):89–108.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref9] 9. Halkidi M, Vazirgiannis M, editors. Clustering validity assessment: Finding the optimal partitioning of a data set. Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM 01); 2001; California, USA: Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM 01).

[ref10] 10. Suresh K, Kundu D, Ghosh S, Das S, Abraham A, Han SY. Multi-objective differential evolution for automatic clustering with application to micro-array data analysis. Sensors. 2009;9(5):3981–4004. pmid:22412346
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref11] 11. Liu Y, Özyer T, Alhajj R, Barker K. Integrating Multi-Objective Genetic Algorithm and Validity Analysis for Locating and Ranking Alternative Clustering. Informatica (Slovenia). 2005;29(1):33–40.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref12] 12. Matake N, Hiroyasu T, Miki M, Senda T, editors. Multiobjective clustering with automatic k-determination for large-scale data. Proceedings of the 9th annual conference on Genetic and evolutionary computation; 2007; London, England: In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2007).

[ref13] 13. Bandyopadhyay S, Saha S. GAPS: A clustering method using a new point symmetry-based distance measure. Pattern recognition. 2007;40(12):3430–51.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref14] 14. MacQueen J, editor Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability; 1967; Oakland, CA, USA.: In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability.

[ref15] 15. Everitt BS, Landau S, Leese M. Cluster Analysis: Hodder Arnold, London 2001.

[ref16] 16. Davies DL, Bouldin DW. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1979;1(2):224–7. pmid:21868852
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref17] 17. Saha S, Bandyopadhyay S. Some connectivity based cluster validity indices. Applied Soft Computing. 2012;12(5):1555–65.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref18] 18. Ulungu B, Teghem J, Fortemps P. Heuristic for multi-objective combinatorial optimization problems by simulated annealing. In: Gu J, Chen G, Wei Q, Wang S, editors. MCDM: Theory and Applications: Sci-Tech; 1995. p. 229–38.

[ref19] 19. Goldberg DE, Richardson J, editors. Genetic algorithms with sharing for multimodal function optimization. Genetic algorithms and their applications: Proceedings of the Second International Conference on Genetic Algorithms; 1987; Cambridge, MA, USA Genetic algorithms and their applications: Proceedings of the Second International Conference on Genetic Algorithms.

[ref20] 20. Saha S, Bandyopadhyay S. A new multiobjective simulated annealing based clustering technique using symmetry. Pattern Recognition Letters. 2009;30(15):1392–403.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref21] 21. Shieh H-L, Kuo C-C, Chiang C-M. Modified particle swarm optimization algorithm with simulated annealing behavior and its numerical verification. Applied Mathematics and Computation. 2011;218(8):4365–83.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref22] 22. Mitra D, Romeo F, Sangiovanni-Vincentelli A. Convergence and Finite-Time Behavior of Simulated Annealing. Advances in Applied Probability. 1986;18(3):747–71.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref23] 23. Hruschka ER, Campello RJGB, Freitas AA, De Carvalho APLF. A survey of evolutionary algorithms for clustering. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. 2009;39(2):133–55.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref24] 24. Yang SL, Li YS, Hu XX, PAN RY. Optimization Study on k Value of K-means Algorithm. Systems Engineering-Theory & Practice. 2006;2:97–101. pmid:25546650
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref25] 25. Toussaint GT. The relative neighbourhood graph of a finite planar set. Pattern recognition. 1980;12(4):261–8.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref26] 26. Salazar-Lechuga M, Rowe JE, editors. Particle swarm optimization and fitness sharing to solve multi-objective optimization problems. Congress on Evolutionary Computation (CEC’2005); 2005; Edinburgh, Scotland, UK: Congress on Evolutionary Computation (CEC’2005). https://doi.org/10.1016/j.biosystems.2015.05.002 pmid:25982071

[ref27] 27. Jaccard P. Lois de distribution florale dans la zone alpine. Bull Soc Vaudoise Sci Nat. 1902;38:69–130.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref28] 28. Deb K. Multi-objective optimization using evolutionary algorithms. New York: Wiley; 2001.

[ref29] 29. Jardine N, Sibson R. Mathematical taxonomy. New York: Wiley; 1971.

[ref30] 30. Bandyopadhyay S, Pal SK. Classification and learning using genetic algorithms: applications in bioinformatics and web intelligence: Springer Science & Business Media; 2007.

[ref31] 31. Pal SK, Mitra S. Fuzzy versions of Kohonen's net and MLP-based classification: performance evaluation for certain nonconvex decision regions. Information Sciences. 1994;76(3):297–337.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref32] 32. Lichman M. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California School of Information and Computer Science. 2013.

[ref33] 33. Fung BC, Wang K, Ester M, editors. Hierarchical document clustering using frequent itemsets. SDM; 2003; San Francisco, CA: In Proceedings of the 3rd SIAM International Conference on Data Mining.

[ref34] 34. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B 2001;63(2):411–23.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

Figures

Abstract

Introduction

Clustering Problem

The Proposed MOPSOSA Algorithm

Particles swarm initialization

Objective functions

DB-index.

Sym-index.

Conn-index.

XP updating

Repository updating

Cluster re-numbering

Velocity computation

Position computation

MOSA technique

Selection of the best solution

Experimental Study

Experimental datasets

Evaluating the clustering quality

Parameter settings

Results and Discussions

Discussion of the artificial datasets results

Discussion of the real-life datasets results

Summary of results

Conclusion

Supporting Information

S1 File. Experimental datasets.

Author Contributions

References