Correlation Network analysis of Biological Data Zhaolong Yu Fudan University, Shanghai, 200433 Abstract Correlation network analysis has been widely used for finding clusters or modules in complex networks, especially in biological networks and stock networks. On the basis of correlations between quantitative measurements weighted correlation network analysis can be implemented to identify modules formed by highly correlated elements such as genes or proteins in the biological networks. With the help of this method, we are able to explore the system-level functionality of certain genes. In this article, we tried to take advantages of weighted correlation network analysis to investigate gene co-expression networks in the context of transcriptional response of cells to changing conditions. Introduction Networks provide a straightforward representation of interactions between different elements in a system, which enables us gain insights about the dynamics of complex systems under various conditions. In the past ten years, network-based methods have been found useful in many domains including social, physical and biological system analyses. For example, in social networks, network-based methods could help us predict potential links between two people, detect highly connected community and identify the most influential"superstar When it comes to biological networks, with the rapid development of biomedical science, more and more biological networks have been identified, such as gene co expression networks, protein-protein interaction networks and cell-cell interaction networks. Previous research about various biological processes fell short of accurately quantifying the biological molecules and tracking the biological reactions in a systematic view Simply measuring the expression of few genes and investigating the molecular mechanisms in one or two pathways necessarily help explain complex biological process during which thousands of reactions are ongoing at the same time. Given the fact that there exists a gene regulation network in every cell in which around 20,000 genes, millions of RNAs and proteins interact with each other and achieve a balance network analyses have made it possible to take different kinds of biological components into consideration and delve deeper and deeper to see through the underlying mechanisms of gene expression and regulation In many real networks, the probability that a node is connected with k other node p(k) decays as a power law. Many biological networks follow the same structure
Correlation Network Analysis of Biological Data Zhaolong Yu Fudan University, Shanghai, 200433 Abstract Correlation network analysis has been widely used for finding clusters or modules in complex networks, especially in biological networks and stock networks. On the basis of correlations between quantitative measurements, weighted correlation network analysis can be implemented to identify modules formed by highly correlated elements such as genes or proteins in the biological networks. With the help of this method, we are able to explore the system-level functionality of certain genes. In this article, we tried to take advantages of weighted correlation network analysis to investigate gene co-expression networks in the context of transcriptional response of cells to changing conditions. Introduction Networks provide a straightforward representation of interactions between different elements in a system, which enables us gain insights about the dynamics of complex systems under various conditions. In the past ten years, network-based methods have been found useful in many domains including social, physical and biological system analyses. For example, in social networks, network-based methods could help us predict potential links between two people, detect highly connected community and identify the most influential “superstar”. When it comes to biological networks, with the rapid development of biomedical science, more and more biological networks have been identified, such as gene coexpression networks, protein-protein interaction networks and cell-cell interaction networks. Previous research about various biological processes fell short of accurately quantifying the biological molecules and tracking the biological reactions in a systematic view. Simply measuring the expression of few genes and investigating the molecular mechanisms in one or two pathways do not necessarily help explain complex biological process during which thousands of reactions are ongoing at the same time. Given the fact that there exists a gene regulation network in every cell in which around 20,000 genes, millions of RNAs and proteins interact with each other and achieve a balance, network analyses have made it possible to take different kinds of biological components into consideration and delve deeper and deeper to see through the underlying mechanisms of gene expression and regulation. In many real networks, the probability that a node is connected with k other node p(k) decays as a power law. Many biological networks follow the same structure
where the topology is dominated by a few highly connected nodes (hubs) which link the rest of the less connected node. For example, analysis of the protein protein interaction network revealed that highly connected nodes are more likely to be essential for survival, namely household genes or proteins. To have a better understanding of biological networks, one of the most important things need to be done is to figure out the relationships between different components inside the cell. Correlation network analysis turns out to be an effective method to measure this kind of relationships and detect the functional clusters Correlation networks are constructed on the basis of correlations between quantitative measurements that can be described by an n x m matrix X where the row indices correspond to network nodes (i=1, 2, 3..., n and the column indices (=1, 2, 3..., m) correspond to sample measurements The apparent rationale behind correlation network methodology is to use network language to find clusters(modules )of interconnected nodes, which means a set of nodes closely connected according to a suitably defined measure of interconnectedness(correlation). The second usage of correlation network is to identify significant modules among all the modules that are computed by the analysis pipeline. By virtue of a node significance measure, modules with high average node significance are identified as significant modules. Also, with correlation networks, we can easily annotate all network nodes to certain functional modules so that the potential functions of certain genes or proteins in certain biological process could be identified. This can be accomplished by defining a fuzzy measure of module memberships that generalizes the binary module membership indicator to a quantitative measure In conclusion with the help of correlation network analysis, we could gain deeper insights into the biological regulation network and try to predict what is really happening inside the cells Materials and methods In this article, we used weighted correlation network analysis pipeline to investigate gene co-expression network and tried to explain the regulatory relationships between different players in gene regulation network. First of all, we define a measurement of similarity between the gene expression profiles. This similarity measures the extent of concordance between gene expressions over a period of time or across different experiment conditions such as, the expression profiles of gene p53 in the process of pathogenesis of tumor or the expression levels of gene HuR under different concentrations of ATP. Specifically, for each pair of genes i and j, we denote this similarity measurement by Sij, and the similarity between genes i and j is defined as the absolute value of the pearson correlation This Pearson correlation score are calculated from an n x m matrix X where the
where the topology is dominated by a few highly connected nodes (hubs) which link the rest of the less connected node. For example, analysis of the proteinprotein interaction network revealed that highly connected nodes are more likely to be essential for survival, namely household genes or proteins. To have a better understanding of biological networks, one of the most important things need to be done is to figure out the relationships between different components inside the cell. Correlation network analysis turns out to be an effective method to measure this kind of relationships and detect the functional clusters. Correlation networks are constructed on the basis of correlations between quantitative measurements that can be described by an n × m matrix X where the row indices correspond to network nodes (i = 1, 2, 3 . . . , n) and the column indices (l = 1, 2, 3 . . . , m) correspond to sample measurements. The apparent rationale behind correlation network methodology is to use network language to find clusters (modules) of interconnected nodes, which means a set of nodes closely connected according to a suitably defined measure of interconnectedness (correlation). The second usage of correlation network is to identify significant modules among all the modules that are computed by the analysis pipeline. By virtue of a node significance measure, modules with high average node significance are identified as significant modules. Also, with correlation networks, we can easily annotate all network nodes to certain functional modules so that the potential functions of certain genes or proteins in certain biological process could be identified. This can be accomplished by defining a fuzzy measure of module memberships that generalizes the binary module membership indicator to a quantitative measure. In conclusion, with the help of correlation network analysis, we could gain deeper insights into the biological regulation network and try to predict what is really happening inside the cells. Materials and Methods In this article, we used weighted correlation network analysis pipeline to investigate gene co-expression network and tried to explain the regulatory relationships between different players in gene regulation network. First of all, we define a measurement of similarity between the gene expression profiles. This similarity measures the extent of concordance between gene expressions over a period of time or across different experiment conditions such as, the expression profiles of gene p53 in the process of pathogenesis of tumor or the expression levels of gene HuR under different concentrations of ATP. Specifically, for each pair of genes i and j, we denote this similarity measurement by sij, and the similarity between genes i and j is defined as the absolute value of the Pearson correlation sij = |cor(i,j)|, This Pearson correlation score are calculated from an n × m matrix X where the
row indices correspond to network nodes(i=1, 2, 3..., n)and the column indices (=1, 2, 3..., m) correspond to different sample measurements of the same node Moreover, we denote the similarity matrix by S=[sij Secondly, we transform the similarity matrix into an adjacency matrix. Since the unweighted networks are unable to reflect the continuous nature of the underlying co-expression information, instead of implementing hard thresholding resulting in an unweighted network, we choose soft-thresholding strategy to generate the adjacent matrix for the weighted network. The weighted network adjacency can be defined by raising the co-expression similarity to a power with B21. The parameter B is returned by r function pick Threshold and it could be easily seen that the weighted adjacency ai between two genes is proportional to their similarity on a logarithmic scale, log (aij=E x log(si) Thirdly, we use the topological overlap dissimilarity measure to identify the functional modules which consists of densely interconnected genes without the use of priori defined gene sets. The default method is hierarchical clustering with the standard r function hclust and branches of the hierarchical clustering dendrogram correspond to modules can be identified using one of a wide range of available branch cutting methods including the constant-height cut and Dynamic Tree Cut method. The topological overlap of two nodes reflects their relative interconnectedness and the topological overlap matrix(tOM)n2=[oiil provides a similarity measure (opposite of dissimilarity), which has been found useful in unweighted and weighted networks {k,k}+1-a wherelij= 2uaiuauj, and ki is the node connectivity To calculate a dissimilarity measure, we use formula d-1-wg to define the topological overlap-based dissimilarity measure. Once the gene modules have been determined, what we need to do now is to relate the gene modules to external information. Based on the gene sets generated in the fourth step, we can implement functional enrichment analysis to figure out whether the genes in the gene modules have some special enriched cellular functions. Furthermore, we need to identify biologically or clinically significant modules and genes, which is a major goal of gene expression analyses. The definition of biological or clinical significance depends on the research question under consideration. Abstractly speaking, we define a gene significance measure as a function that assigns a non-negative number to each gene; the higher the value is, the more biologically significant the gene is In gene knockout experiments gene significance could indicate knockout essentiality while a microarray sample trait t can be used to define a trait-based gene significance measure as the absolute correlation between the trait and the expression profiles. For a functional
row indices correspond to network nodes (i = 1, 2, 3 . . . , n) and the column indices (l = 1, 2, 3 . . . , m) correspond to different sample measurements of the same node. Moreover, we denote the similarity matrix by S = [sij]. Secondly, we transform the similarity matrix into an adjacency matrix. Since the unweighted networks are unable to reflect the continuous nature of the underlying co-expression information, instead of implementing hard thresholding resulting in an unweighted network, we choose soft-thresholding strategy to generate the adjacent matrix for the weighted network. The weighted network adjacency can be defined by raising the co-expression similarity to a power aij =𝑠𝑖𝑗 𝛽 , with β≥1. The parameter β is returned by R function pickSoftThreshold and it could be easily seen that the weighted adjacency aij between two genes is proportional to their similarity on a logarithmic scale, log(aij) = E × log(sij). Thirdly, we use the topological overlap dissimilarity measure to identify the functional modules which consists of densely interconnected genes without the use of priori defined gene sets. The default method is hierarchical clustering with the standard R function hclust and branches of the hierarchical clustering dendrogram correspond to modules can be identified using one of a wide range of available branch cutting methods including the constant-height cut and Dynamic Tree Cut method. The topological overlap of two nodes reflects their relative interconnectedness and the topological overlap matrix (TOM) Ω = [ωij] provides a similarity measure (opposite of dissimilarity), which has been found useful in unweighted and weighted networks where𝑙𝑖𝑗 = ∑𝑢 𝑎𝑖𝑢𝑎𝑢𝑗, and ki is the node connectivity. To calculate a dissimilarity measure, we use formula to define the topological overlap-based dissimilarity measure. Once the gene modules have been determined, what we need to do now is to relate the gene modules to external information. Based on the gene sets generated in the fourth step, we can implement functional enrichment analysis to figure out whether the genes in the gene modules have some special enriched cellular functions. Furthermore, we need to identify biologically or clinically significant modules and genes, which is a major goal of gene expression analyses. The definition of biological or clinical significance depends on the research question under consideration. Abstractly speaking, we define a gene significance measure as a function that assigns a non-negative number to each gene; the higher the value is, the more biologically significant the gene is. In gene knockout experiments, gene significance could indicate knockout essentiality while a microarray sample trait T can be used to define a trait-based gene significance measure as the absolute correlation between the trait and the expression profiles. For a functional
module, a measure of module significance can be defined as average gene significance across the module genes Next, studying topological properties of biological network is also of great importance. Many topological properties of networks can be succinctly described using network concepts, also known as network statistics including whole network connectivity(degree), intramodular connectivity, topological overlap, the clustering coefficient, density and so on Differential analysis of network concepts such as network connectivity may reveal potential regulatory changes in certain gene expressions. The WGCNa package of R implements several functions, such as softConnectivity, intramodular Connectivity, TOMSimilarity, cluster Coef networkConcepts, for computing these network statistics. Basic R functions can be used to create summary statistics of these concepts and for testing their differences across networks Results and discussions 1. Data cleaning and preprocessing In this article, we downloaded the gene expression data(microarray data of female liver cells and microarray data of male liver cells) from the online microarray database. These two data sets contain roughly 130 samples each Note that each row corresponds to a gene and each column to a sample or othe experiment information. We extracted the expression data from the raw file into a multi-set format suitable for consensus analysis. Due to the large numbers of missing data, we implemented R function goodSamplesgenesMs to filter the sample which contains excessive number of missing data. Moreover, we used Euclidean distance-based sample clustering to filter out the sample which fell in the range of outliers, there was a sample named F2 221 seemed to be the outlier in the female liver data. After this quality control, the two datasets were ready for further analysis Sample clustering on all genes in Female liver Figure 1 Sample clustering result We also downloaded the gene annotation file and clinical traits file so that we
module, a measure of module significance can be defined as average gene significance across the module genes. Next, studying topological properties of biological network is also of great importance. Many topological properties of networks can be succinctly described using network concepts, also known as network statistics including whole network connectivity (degree), intramodular connectivity, topological overlap, the clustering coefficient, density and so on. Differential analysis of network concepts such as network connectivity may reveal potential regulatory changes in certain gene expressions. The WGCNA package of R implements several functions, such as softConnectivity, intramodularConnectivity, TOMSimilarity, clusterCoef, networkConcepts, for computing these network statistics. Basic R functions can be used to create summary statistics of these concepts and for testing their differences across networks. Results and Discussions 1. Data cleaning and preprocessing In this article, we downloaded the gene expression data (microarray data of female liver cells and microarray data of male liver cells) from the online microarray database. These two data sets contain roughly 130 samples each. Note that each row corresponds to a gene and each column to a sample or other experiment information. We extracted the expression data from the raw file into a multi-set format suitable for consensus analysis. Due to the large numbers of missing data, we implemented R function goodSamplesGenesMS to filter the sample which contains excessive number of missing data. Moreover, we used Euclidean distance-based sample clustering to filter out the sample which fell in the range of outliers, there was a sample named F2_221 seemed to be the outlier in the female liver data. After this quality control, the two datasets were ready for further analysis. Figure 1 Sample clustering result We also downloaded the gene annotation file and clinical traits file so that we
could match these information to the expression data 2. Network construction Network construction is the most important step in the relation network analysis Since we chose the one-step soft-thresholding strategy to generate the adjacent matrix for the network, the construction step entails the choice of the soft thresholding power B to which co-expression similarity is raised to calculate adjacency. Given the fact that the gene regulation follows the power law distribution, we choose the soft thresholding power value based on the criterion of approximate scale-free topology. Therefore, we made the use of the function pick Soft Threshold that performs the analysis of network topology From 1 to 15, it seemed that 6, 7 and 8 could be the proper soft-thresholding power values. In order to speed up the calculation and fit the scale-free topology model better, we chose 7 as the soft-thresholding power value Scale Free Topology Model Fit Median connectivity . Male liver 6.7.89.10.41-12134… Soft Threshold(power) Mean connectivity Max connectivity Soft Threshold(power) Soft Threshold (power) Figure 2 Soft-thresholding power test 3. Functional module detection Based on precious results, we chose the soft thresholding power 7, minimum module size 30, the module detection sensitivity deepSplit 2. As for the merging parameters, we set the cut height for merging of modules as 0.20 which meant modules whose gene expressions are correlated above 1-0 0.8 will be merged. It could be easily seen that roughly 11 gene modules or gene clusters had been identified based on weighted correlation networks constructed from the gene expression data. In reality, there are 17 gene modules had been found however
could match these information to the expression data. 2. Network construction Network construction is the most important step in the relation network analysis. Since we chose the one-step soft-thresholding strategy to generate the adjacent matrix for the network, the construction step entails the choice of the soft thresholding power β to which co-expression similarity is raised to calculate adjacency. Given the fact that the gene regulation follows the powerlaw distribution, we choose the soft thresholding power value based on the criterion of approximate scale-free topology. Therefore, we made the use of the function pickSoftThreshold that performs the analysis of network topology. From 1 to 15, it seemed that 6, 7 and 8 could be the proper soft-thresholding power values. In order to speed up the calculation and fit the scale-free topology model better, we chose 7 as the soft-thresholding power value. Figure 2 Soft-thresholding power test 3. Functional module detection Based on precious results, we chose the soft thresholding power 7, minimum module size 30, the module detection sensitivity deepSplit 2. As for the merging parameters, we set the cut height for merging of modules as 0.20, which meant modules whose gene expressions are correlated above 1−0.2 =0.8 will be merged. It could be easily seen that roughly 11 gene modules or gene clusters had been identified based on weighted correlation networks constructed from the gene expression data. In reality, there are 17 gene modules had been found however