A Comparison of Heuristics with Modularity Maximization Objective using Biological Data Sets

Finding groups of objects exhibiting similar patterns is an important data analytics task. Many disciplines have their own terminologies such as cluster, group, clique, community etc. defining the similar objects in a set. Adopting the term community, many exact and heuristic algorithms are developed to find the communities of interest in available data sets. Here, three heuristic algorithms to find communities are compared using five gene expression data sets. The heuristics have a common objective function of maximizing the modularity that is a quality measure of a partition and a reflection of objects’ relevance in communities. Partitions generated by the heuristics are compared with the real ones using the adjusted rand index, one of the most commonly used external validation measures. The paper discusses the results of the partitions on the mentioned biological data sets.


Introduction
Clustering problem has attracted attentions of researchers for decades.The problem exists in almost every discipline and the solutions to address the problem are applicable to any discipline such as engineering, natural and social sciences.While there are general clustering algorith ms, it is also desirable to design clustering approaches for specific problems.
In this regard, optimization discipline has unique tools for clustering.Clustering problem can be perceived as an optimization problem where there is a specific objective function to optimize and some constraints to satisfy.
Mathematical programming approaches for clustering exist.However, the mathemat ical models are practical and applicable to small scale data.Hence, heuristic algorith ms are required to solve large scale optimizat ion problems.
Clustering is an unsupervised learning task/tool encountered in various data min ing applicat ions.The applications span many fields including physics, astronomy, and bioinformatics employing a plethora of algorithms.
Different disciplines favor distinct terms for a set of similar objects, namely cluster, clique, group, or community.A lthough there is not a universal definit ion for the best commun ity, researchers agree that the objects in a co mmun ity must exhib it similar patterns or must be strongly connected based on a defined relat ionship.In other words, the similarity of objects within a cluster should be maximized, and the similarity of objects between clusters should be minimized.
Bioinformatics is still a developing field to address life sciences problems using computational techniques.
Clustering approaches are employed for data analysis in bioinformat ics as well.Analysis of microarray gene coexpression data is one of the bio informatics prob lems where clustering is utilized.
Clustering gene expression data is an integrated task that comprises low-level and high-level analysis.Three main steps of cluster analysis for gene expression data are as follows: 1. data pre-processing: preparing the data so that the clustering algorithm can make use of it as an input; 2. emp loying a clustering algorithm with an appropriate distance measure (if necessary); and 3. using a validation measure (internal and/or external) to validate the quality of the clusters found.Keeping in mind the impo rtance of analysis of high dimensional gene expression data sets, heuristics are promising approaches for clustering high-throughput data such as the ones generated by microarrays.Microarrays measure expression levels of ten thousands of genes simu ltaneously in a single chip.Measurements involve relative exp ression values of each gene through an image processing task.
There is no best clustering approach for the problem on hand and the clustering algorith ms are biased towards certain criteria.In other words, a particular clustering approach has its own objective and assumptions about the data.
For examp le, K-means algorith m is sensitive to noise that is inherent in gene expression data.In addition, the solution (i.e. the final clustering) that the K-means algorith m finds may not be a global optimu m since it relies on randomly chosen initial objects.Hierarchical clustering algorith ms are "greedy" which often means that the final solution is suboptimal due to locally optimal choices being made in in itial steps, which turn out to be poor choices with respect to the global solution.
Here, three heuristic co mmunity structure finding algorith ms are employed on five different gene expression data sets.The algorithms have the common objective of maximizing a commun ity defining measure called modularity.The higher modularity values indicate better clustering.
Maximu m modularity values generated by the algorith ms on the data sets are reported.The partit ions by the algorith ms are evaluated co mparing with the real partitions through a widely used external validation index, adjusted rand index.
The paper is organized as follows: section two describes the community structure finding problem using modularity, section three presents the algorithms and the data sets as well as the results of the study, and section four is the conclusion.

Community Structure Finding
There are many community structure finding (also could be mentioned as pattern recognition or clustering) algorith ms using modularity maximization on a given network G = (V,E) with m =| E | edges and n =| V | nodes.Fro m an optimization perspective, the problem o f finding the best (optimal) co mmun ity can be modeled as an integer linear p rogramming (ILP) problem.The corresponding ILP is as follows: x uv are b inary variables being 1 if there is a connection between nodes u, v and 0 otherwise.First set of constraints are reflectivity constraints.Second set of constraints are symmetry constraints meaning that if object u is connected to object v then the object v is connected to the object u.Third, fourth, and fifth set of constraints are transitivity constraints meaning that when the object u is connected to the object v and the object v is connected to the object w then the object u is also connected to the object w.Noticing the redundancies in terms of variab les and constraints, the number of variables and constraints are of ൫ ଶ ൯, ൫ ଷ ൯ respectively as the redundancies are removed.
In their paper, Brandes et al. [1] prove the problem to be NP-co mplete.Hence, heuristic algorith ms are required to generate reasonable results close to global optimu m.Newman and Girvan [2] propose one of the most cited community structure finding algorithms.The algorith m finds communit ies removing the most between edges of the graph that is constructed from a data set.One way to define the betweenness is through counting the number of shortest paths passing along an edge.The algorithm s worst-case time complexity is O(m 2 n).
Pons and Latapy [3] co mpute communit ies using random walks.The algorith m, called walktrap, has worstcase time co mp lexity of O(mn 2 ).Clauset and Newman [4] propose a fast greedy community structure finding algorith m based on hierarchical agglo merat ion.The algorithm s worst-case time co mplexity is O(md logn) where d is the depth of the dendrogram describ ing the community structure.When the network is sparse and dendrogram is balanced, the algorith m runs in O(nlog 2 n) time.
Clique perco lation [5] and label propagation [6] are some of the recent methods to find commun ity structures.Furtunato [7] presents a recent review on algorithmic methods to detect community structure in networks.

Methods and Results
Three co mmunity structure finding algorith ms, namely betweenness [2], walktrap [3], greedy [4] are co mpared using five gene expression data sets.The reason of selecting these algorithms is the ease of imp lementation and prevalence of these algorithms.R p rogramming implementations of igraph library [8] are employed.
The data sets are summarized in Table 1.The BreastA is a t wo-channel o ligonucleotide microarray data set.The BreastB is one-channel microarray data set.Both are cancer diagnosis data sets.DLBCLA is a d iffuse large Bcell ly mphoma data set.These three data sets are published in [9].Leukemia data set is obtained online at http://www.broadinstitute.org/cgibin/cancer/datasets.cgi.CNS data set has the nine time points observation of 112 rat genes.The data set is addressed in [10].
Each data set is represented as a complete network where nodes represent the samples (tissues) or the genes (for CNS data only), the edges represent the relationships with Pearson correlation valued strengths.Then the networks are trimmed removing the edges with the least correlation values until the networks become disconnected.The corresponding threshold values for the data sets in the order shown on Table 1 are 0.293, 0.545, 0.839, 0.225, 0.610.The co mmunity structure finding algorith ms use the trimmed netwo rks to generate partitions and modularity values.The partitions are used to calculate the adjusted rand index [11] values.The application work flow is shown in Figure 1.
The results shown on Table 2 are the maximu m modularity values corresponding to the partitions generated by the algorith ms.mod1, mod 2, mod3 are maximu m modularity values obtained by betweenness, walktrap, and greedy algorith ms respectively.The values in bold are the maximu m of the maximu m modularity values found by the algorithms.
The results shown on Table 3 are    Adjusted rand index (A RI) values are co mputed in R using clues [12] package.The adjusted rand index values are calcu lated by the partition fro m a clustering algorith m (P1) and the real partition (P2).The ARI(P1,P2) formulation is as follows: n i,j is the number of co mmon objects fro m clusters i and j, i is the cluster index of the first partition and j is the cluster index of the second partition.n i. represents the number of objects in cluster i.Higher adjusted rand index values indicates better clusters.In other words, the higher the index value the closer the partition to the real partit ion.ARI values lie between -1 and 1.

DOI: 10
.1051/ C Owned by the authors, published by EDP Sciences, 201 adjusted rand index values corresponding to the partitions generated by the algorith ms.rand1, rand2, rand3 are ad justed rand index values obtained by betweenness, walktrap, and greedy algorith ms respectively.The values in bold are the maximu m o f the maximu m modularity values found by the algorithms.

Figure 2
Figure 2 illustrates the partition obtained by the walktrap algorith m for Leu kemia data set.Shaded regions (different colored) indicate the clusters found by the algorithm.

Figure 2 .
Figure 2. The partition by the walktrap algorithm for Leukemia data set.