复旦大学：《网络科学导论 Introduction to Network Science》学生课程项目论文_Weighted Correlation Network Analysis of Biological Data（生物网络分析）.pdf

Correlation Network analysis of Biological Data Zhaolong Yu Fudan University, Shanghai, 200433 Abstract Correlation network analysis has been widely used for finding clusters or modules in complex networks, especially in biological networks and stock networks. On the basis of correlations between quantitative measurements weighted correlation network analysis can be implemented to identify modules formed by highly correlated elements such as genes or proteins in the biological networks. With the help of this method, we are able to explore the system-level functionality of certain genes. In this article, we tried to take advantages of weighted correlation network analysis to investigate gene co-expression networks in the context of transcriptional response of cells to changing conditions. Introduction Networks provide a straightforward representation of interactions between different elements in a system, which enables us gain insights about the dynamics of complex systems under various conditions. In the past ten years, network-based methods have been found useful in many domains including social, physical and biological system analyses. For example, in social networks, network-based methods could help us predict potential links between two people, detect highly connected community and identify the most influential"superstar When it comes to biological networks, with the rapid development of biomedical science, more and more biological networks have been identified, such as gene co expression networks, protein-protein interaction networks and cell-cell interaction networks. Previous research about various biological processes fell short of accurately quantifying the biological molecules and tracking the biological reactions in a systematic view Simply measuring the expression of few genes and investigating the molecular mechanisms in one or two pathways necessarily help explain complex biological process during which thousands of reactions are ongoing at the same time. Given the fact that there exists a gene regulation network in every cell in which around 20,000 genes, millions of RNAs and proteins interact with each other and achieve a balance network analyses have made it possible to take different kinds of biological components into consideration and delve deeper and deeper to see through the underlying mechanisms of gene expression and regulation In many real networks, the probability that a node is connected with k other node p(k) decays as a power law. Many biological networks follow the same structure

Correlation Network Analysis of Biological Data Zhaolong Yu Fudan University, Shanghai, 200433 Abstract Correlation network analysis has been widely used for finding clusters or modules in complex networks, especially in biological networks and stock networks. On the basis of correlations between quantitative measurements, weighted correlation network analysis can be implemented to identify modules formed by highly correlated elements such as genes or proteins in the biological networks. With the help of this method, we are able to explore the system-level functionality of certain genes. In this article, we tried to take advantages of weighted correlation network analysis to investigate gene co-expression networks in the context of transcriptional response of cells to changing conditions. Introduction Networks provide a straightforward representation of interactions between different elements in a system, which enables us gain insights about the dynamics of complex systems under various conditions. In the past ten years, network-based methods have been found useful in many domains including social, physical and biological system analyses. For example, in social networks, network-based methods could help us predict potential links between two people, detect highly connected community and identify the most influential “superstar”. When it comes to biological networks, with the rapid development of biomedical science, more and more biological networks have been identified, such as gene coexpression networks, protein-protein interaction networks and cell-cell interaction networks. Previous research about various biological processes fell short of accurately quantifying the biological molecules and tracking the biological reactions in a systematic view. Simply measuring the expression of few genes and investigating the molecular mechanisms in one or two pathways do not necessarily help explain complex biological process during which thousands of reactions are ongoing at the same time. Given the fact that there exists a gene regulation network in every cell in which around 20,000 genes, millions of RNAs and proteins interact with each other and achieve a balance, network analyses have made it possible to take different kinds of biological components into consideration and delve deeper and deeper to see through the underlying mechanisms of gene expression and regulation. In many real networks, the probability that a node is connected with k other node p(k) decays as a power law. Many biological networks follow the same structure

where the topology is dominated by a few highly connected nodes (hubs) which link the rest of the less connected node. For example, analysis of the protein protein interaction network revealed that highly connected nodes are more likely to be essential for survival, namely household genes or proteins. To have a better understanding of biological networks, one of the most important things need to be done is to figure out the relationships between different components inside the cell. Correlation network analysis turns out to be an effective method to measure this kind of relationships and detect the functional clusters Correlation networks are constructed on the basis of correlations between quantitative measurements that can be described by an n x m matrix X where the row indices correspond to network nodes (i=1, 2, 3..., n and the column indices (=1, 2, 3..., m) correspond to sample measurements The apparent rationale behind correlation network methodology is to use network language to find clusters(modules )of interconnected nodes, which means a set of nodes closely connected according to a suitably defined measure of interconnectedness(correlation). The second usage of correlation network is to identify significant modules among all the modules that are computed by the analysis pipeline. By virtue of a node significance measure, modules with high average node significance are identified as significant modules. Also, with correlation networks, we can easily annotate all network nodes to certain functional modules so that the potential functions of certain genes or proteins in certain biological process could be identified. This can be accomplished by defining a fuzzy measure of module memberships that generalizes the binary module membership indicator to a quantitative measure In conclusion with the help of correlation network analysis, we could gain deeper insights into the biological regulation network and try to predict what is really happening inside the cells Materials and methods In this article, we used weighted correlation network analysis pipeline to investigate gene co-expression network and tried to explain the regulatory relationships between different players in gene regulation network. First of all, we define a measurement of similarity between the gene expression profiles. This similarity measures the extent of concordance between gene expressions over a period of time or across different experiment conditions such as, the expression profiles of gene p53 in the process of pathogenesis of tumor or the expression levels of gene HuR under different concentrations of ATP. Specifically, for each pair of genes i and j, we denote this similarity measurement by Sij, and the similarity between genes i and j is defined as the absolute value of the pearson correlation This Pearson correlation score are calculated from an n x m matrix X where the

where the topology is dominated by a few highly connected nodes (hubs) which link the rest of the less connected node. For example, analysis of the proteinprotein interaction network revealed that highly connected nodes are more likely to be essential for survival, namely household genes or proteins. To have a better understanding of biological networks, one of the most important things need to be done is to figure out the relationships between different components inside the cell. Correlation network analysis turns out to be an effective method to measure this kind of relationships and detect the functional clusters. Correlation networks are constructed on the basis of correlations between quantitative measurements that can be described by an n × m matrix X where the row indices correspond to network nodes (i = 1, 2, 3 . . . , n) and the column indices (l = 1, 2, 3 . . . , m) correspond to sample measurements. The apparent rationale behind correlation network methodology is to use network language to find clusters (modules) of interconnected nodes, which means a set of nodes closely connected according to a suitably defined measure of interconnectedness (correlation). The second usage of correlation network is to identify significant modules among all the modules that are computed by the analysis pipeline. By virtue of a node significance measure, modules with high average node significance are identified as significant modules. Also, with correlation networks, we can easily annotate all network nodes to certain functional modules so that the potential functions of certain genes or proteins in certain biological process could be identified. This can be accomplished by defining a fuzzy measure of module memberships that generalizes the binary module membership indicator to a quantitative measure. In conclusion, with the help of correlation network analysis, we could gain deeper insights into the biological regulation network and try to predict what is really happening inside the cells. Materials and Methods In this article, we used weighted correlation network analysis pipeline to investigate gene co-expression network and tried to explain the regulatory relationships between different players in gene regulation network. First of all, we define a measurement of similarity between the gene expression profiles. This similarity measures the extent of concordance between gene expressions over a period of time or across different experiment conditions such as, the expression profiles of gene p53 in the process of pathogenesis of tumor or the expression levels of gene HuR under different concentrations of ATP. Specifically, for each pair of genes i and j, we denote this similarity measurement by sij, and the similarity between genes i and j is defined as the absolute value of the Pearson correlation sij = |cor(i,j)|, This Pearson correlation score are calculated from an n × m matrix X where the

row indices correspond to network nodes(i=1, 2, 3..., n)and the column indices (=1, 2, 3..., m) correspond to different sample measurements of the same node Moreover, we denote the similarity matrix by S=[sij Secondly, we transform the similarity matrix into an adjacency matrix. Since the unweighted networks are unable to reflect the continuous nature of the underlying co-expression information, instead of implementing hard thresholding resulting in an unweighted network, we choose soft-thresholding strategy to generate the adjacent matrix for the weighted network. The weighted network adjacency can be defined by raising the co-expression similarity to a power with B21. The parameter B is returned by r function pick Threshold and it could be easily seen that the weighted adjacency ai between two genes is proportional to their similarity on a logarithmic scale, log (aij=E x log(si) Thirdly, we use the topological overlap dissimilarity measure to identify the functional modules which consists of densely interconnected genes without the use of priori defined gene sets. The default method is hierarchical clustering with the standard r function hclust and branches of the hierarchical clustering dendrogram correspond to modules can be identified using one of a wide range of available branch cutting methods including the constant-height cut and Dynamic Tree Cut method. The topological overlap of two nodes reflects their relative interconnectedness and the topological overlap matrix(tOM)n2=[oiil provides a similarity measure (opposite of dissimilarity), which has been found useful in unweighted and weighted networks {k,k}+1-a wherelij= 2uaiuauj, and ki is the node connectivity To calculate a dissimilarity measure, we use formula d-1-wg to define the topological overlap-based dissimilarity measure. Once the gene modules have been determined, what we need to do now is to relate the gene modules to external information. Based on the gene sets generated in the fourth step, we can implement functional enrichment analysis to figure out whether the genes in the gene modules have some special enriched cellular functions. Furthermore, we need to identify biologically or clinically significant modules and genes, which is a major goal of gene expression analyses. The definition of biological or clinical significance depends on the research question under consideration. Abstractly speaking, we define a gene significance measure as a function that assigns a non-negative number to each gene; the higher the value is, the more biologically significant the gene is In gene knockout experiments gene significance could indicate knockout essentiality while a microarray sample trait t can be used to define a trait-based gene significance measure as the absolute correlation between the trait and the expression profiles. For a functional

row indices correspond to network nodes (i = 1, 2, 3 . . . , n) and the column indices (l = 1, 2, 3 . . . , m) correspond to different sample measurements of the same node. Moreover, we denote the similarity matrix by S = [sij]. Secondly, we transform the similarity matrix into an adjacency matrix. Since the unweighted networks are unable to reflect the continuous nature of the underlying co-expression information, instead of implementing hard thresholding resulting in an unweighted network, we choose soft-thresholding strategy to generate the adjacent matrix for the weighted network. The weighted network adjacency can be defined by raising the co-expression similarity to a power aij =𝑠𝑖𝑗 𝛽 , with β≥1. The parameter β is returned by R function pickSoftThreshold and it could be easily seen that the weighted adjacency aij between two genes is proportional to their similarity on a logarithmic scale, log(aij) = E × log(sij). Thirdly, we use the topological overlap dissimilarity measure to identify the functional modules which consists of densely interconnected genes without the use of priori defined gene sets. The default method is hierarchical clustering with the standard R function hclust and branches of the hierarchical clustering dendrogram correspond to modules can be identified using one of a wide range of available branch cutting methods including the constant-height cut and Dynamic Tree Cut method. The topological overlap of two nodes reflects their relative interconnectedness and the topological overlap matrix (TOM) Ω = [ωij] provides a similarity measure (opposite of dissimilarity), which has been found useful in unweighted and weighted networks where𝑙𝑖𝑗 = ∑𝑢 𝑎𝑖𝑢𝑎𝑢𝑗, and ki is the node connectivity. To calculate a dissimilarity measure, we use formula to define the topological overlap-based dissimilarity measure. Once the gene modules have been determined, what we need to do now is to relate the gene modules to external information. Based on the gene sets generated in the fourth step, we can implement functional enrichment analysis to figure out whether the genes in the gene modules have some special enriched cellular functions. Furthermore, we need to identify biologically or clinically significant modules and genes, which is a major goal of gene expression analyses. The definition of biological or clinical significance depends on the research question under consideration. Abstractly speaking, we define a gene significance measure as a function that assigns a non-negative number to each gene; the higher the value is, the more biologically significant the gene is. In gene knockout experiments, gene significance could indicate knockout essentiality while a microarray sample trait T can be used to define a trait-based gene significance measure as the absolute correlation between the trait and the expression profiles. For a functional

module, a measure of module significance can be defined as average gene significance across the module genes Next, studying topological properties of biological network is also of great importance. Many topological properties of networks can be succinctly described using network concepts, also known as network statistics including whole network connectivity(degree), intramodular connectivity, topological overlap, the clustering coefficient, density and so on Differential analysis of network concepts such as network connectivity may reveal potential regulatory changes in certain gene expressions. The WGCNa package of R implements several functions, such as softConnectivity, intramodular Connectivity, TOMSimilarity, cluster Coef networkConcepts, for computing these network statistics. Basic R functions can be used to create summary statistics of these concepts and for testing their differences across networks Results and discussions 1. Data cleaning and preprocessing In this article, we downloaded the gene expression data(microarray data of female liver cells and microarray data of male liver cells) from the online microarray database. These two data sets contain roughly 130 samples each Note that each row corresponds to a gene and each column to a sample or othe experiment information. We extracted the expression data from the raw file into a multi-set format suitable for consensus analysis. Due to the large numbers of missing data, we implemented R function goodSamplesgenesMs to filter the sample which contains excessive number of missing data. Moreover, we used Euclidean distance-based sample clustering to filter out the sample which fell in the range of outliers, there was a sample named F2 221 seemed to be the outlier in the female liver data. After this quality control, the two datasets were ready for further analysis Sample clustering on all genes in Female liver Figure 1 Sample clustering result We also downloaded the gene annotation file and clinical traits file so that we