Top Links
Journal of Computational Systems Biology
ISSN: 2455-7625
Modeling of Gene Regulatory Networks: A Literature Review
Copyright: © 2014 Fadhl M Alakwaa. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Related article at Pubmed, Google Scholar
In the last years numerous methods have been developed and applied to reconstruct the structure and dynamic rules of gene-regulatory networks from different high-throughput data sources such as gene expression data. In this paper we summarized four promising modeling approaches to obtain a better understanding of their relative strengths and weaknesses which, in turn, can have a profound effect on developing techniques for drug testing and therapeutic intervention for effective treatment of human diseases and help the systems biology community.
Keywords: Gene Expression Data; Bayesian Network; Gene Regulatory Network; System Biology; Functional genomicsSelection
As basic building blocks of life, genes, as well as their products (proteins), do not work independently. Instead, in order for a cell to function properly, they interact with each other and form a complicated network. Gene networks represent the relationship between sets of genes that coordinate to achieve different tasks. A variety of clustering algorithms have been used to group together similar temporal expression patterns and thus reveal clusters of genes that seem to be co-regulated in experiments (see [1] for a review of clustering techniques). Genome-wide clustering of gene expression profiles provides an important first step towards this goal by grouping together genes that exhibit similar transcriptional responses to various cellular conditions, and are therefore likely to be involved in similar cellular processes. However, the organization of genes into co- regulated clusters provides a very coarse representation of the cellular network. In particular, it cannot separate statistical interactions that are irreducible (i.e., direct) from those arising from cascades of transcriptional interactions that correlate the expression of many non-interacting genes. More generally, as appreciated in statistical physics, long range order (i.e., high correlation among non-directly interacting variables) can easily result from short range interactions [2]. Thus correlations, or any other local dependency measure, cannot be used as the only tool for the reconstruction of interaction networks without additional assumptions [3]. Within the last few years a number of sophisticated approaches for the reverse engineering of cellular networks (also called deconvolution) from gene expression data have emerged. Their goal is to produce Gene gene interaction networks which can be used to discover new medicine and understand the ontology of disease.
As the diversity of reverse engineering methods, we will cover four sophisticated promising modeling approaches. The aim of this paper is to obtain a better understanding of approached strengths and weaknesses on the systems biology community. The rest of the paper is organized as following first we describe models selection criteria which were studied in this paper, Second we cover the description of each model specifications, Third we identified the data source, its requirements and data discretization, fourth we summarized some of GRN construction challenges, fifth we described in brief each modeling schema, sixth we define how the importance of synthetic networks to assess the performance of GRN models then we describe DREAM project where researches meet and discuss their reverse engineering approach, and finally we open some questions to discussion.
We need to say that these models without real and clear problem statement are like computer games simulation. i.e you should to fit your problem with one of these models; not all models work perfectly. For example If we are not interested in predicting the exact concentrations of different substances, but only in the patterns of the systems behavior such as steady states, we can often use simplified Boolean- type networks instead of differential equations [4].
We choose four modeling approach which are Bayesian Network (BYN), Boolean Network (BNN), Non Linear Ordinary Differential Equation (NLODE), and Association Networks (AN), based on their promising results, to they extend on the community and the availability of its implementation which makes an easy for the reader to test each approach on synthetic or read data set without involving in the implementation complicity. The available software for each method are Bayesnet Toolbox [5], Probabilistic Boolean Network [6], Genetic Network Analyzer (GNA) [7] and ARACNE [8] respectively. Table 1 shows the comparison between above GRN modeling approaches as described in the following sections:
Ivan Ivanov and Edwad R Dougherty [9] compare fine-scale stochastic-differential equation models with coarse-scale discrete models in the context of currently available data and with respect to their description of switch-like behavior among specific groups of genes and find that a discrete model has predictive power comparable to that of the stochastic differential equation model under the assumption of complete knowledge of the parameters of the fine-scale model.For more details see data discretization section.
In deterministic models we assume that the next state of the system is determined by the current state and the external inputs. However, in real world systems stochastic effects may play an important role. For instance, for some genes in yeast the number of mRNA molecules is close to one copy per cell [10]. This means that it is likely that there is a considerable intrinsic noise element present – some cells apparently have more mRNA molecules of the given species present than others. Thus modeling a cell by using continuous concentrations effectively means modeling an ensemble of cells by mean values of stochastic variables. Simulating a stochastic model is computationally more expensive, because the simulations have to be run several times to provide a good impression of the system behavior. But stochastic models are not always necessary; it depends on the system that is to be modeled. If the number of molecules involved is small and if important processes depend on random effects, stochastic models might be the best choice.
Gene network inference techniques are data-hungry [11]. Time series and steady-state data are the available gene expression data. Time series has the advantage of being able to identify causal relations, i.e. gene-regulatory relations, between genes without the need of actively perturbing the system. Spellman et al. generated time series data under different culture conditions and using different mutant backgrounds in order to reveal a more comprehensive picture about gene regulation during the yeast cell cycle [12].
Table 2 shows how many data points do we really need to infer a gene network on N genes depends on the model used to do the inference [11].
The need for large numbers of data points, and many different conditions, implies that successful modeling efforts will probably have to use data from different sources like from different high-throughput data sources, mainly microarray based gene expression analysis, promotor sequence information, chromatin immunoprecipitation (ChIP) and protein-protein interaction assays [11].
Florian Geier et al. [13] studied the requirements on data size and data quality that must be met by a successful network reconstruction and we could summarize their findings as the following:
• Short time series generated under transcription factor knock-out are optimal experiments in order to reveal the structure of gene regulatory networks.
• The benefit of using of prior knowledge within a Bayesian learning framework is found to be limited to conditions of small gene expression data size.
• The results suggest that discretization of the continuous data leads to a large information loss.
• Results indicate, that network reconstruction with currently available data will still give rise to many false predictions (FDR ~ 50%).
Reconstructing regulatory networks from gene expression profiles is a challenging problem of functional genomics. In microarray studies the number of samples is often very limited compared to the number of genes, thus the use of discrete data may help reducing the probability of finding random associations between genes. On other hand the results was found by Florian [13] suggest that discretization of the continuous data leads to a large information loss. Barbara Di Camillo et al. [14] confirms that the use of discrete rather than continuous data is advantageous when few samples are available. Continuous approaches are likely to become advantageous with increasing number of samples. There are many discretization methods the simple one is to map gene expression to 0 and 1 by setting an appropriate threshold. See [14] for more details.
Stark J et al. [15] summarized the challenges that a reconstruction of gene-regulatory networks from time series of gene expression data is facing.
• The quality of data derived from high-throughput gene expression experiments is largely limited by noise. For example the typical magnitude of observational noise in microarray measurements is about 20–30% of the signal. In high-throughput techniques dynamical noise maybe expected to play a minor role due to the underlying population sampling of the data. In contrast, data derived from gene expression at the single cell level can exhibit a significant amount of dynamical noise as well as strong cell to cell variations.
• Data size, i.e. length of a time series and number of replicates, is limited by the cost of experiments. The typical length of time series measurements in microarray studies is around 10–20 time points and 3–5 replicates. Therefore, any model underlying network reconstruction methods must be simple, i.e. contain as few parameters as possible, and robust.
• Gene regulation is due to the activity of transcription factors (TFs) which is in most cases post-translationally controlled by additional factors. This activity is not directly observed by measuring TF expression levels.However, many network reconstruction methods based on time series assume the activity of TFs to be directly related with their expression levels, there by omitting additional hidden variables.
Accounting for hidden variables in the framework of network reconstruction methods based on time series demands more data in order to estimate the additional parameters and can complicate a biological interpretation of the hidden variables.
While a variety of computational methods have been considered for reconstructing gene networks from observational gene expression data, Bayesian network (BN) based approaches have shown great promise to infer causal relationships between genes and receive increasing attention. One of the first seminal papers promoting this approach aimed to learn gene regulatory networks in Saccharomyces cerevisiae from gene expression profiles with Bayesian networks [16].
BN are especially suitable for learning genetic regulatory networks for the following reasons: (1) the sound probabilistic semantics allows BNs to deal with the noises that are inherent in experimental measurements; (2) BNs can handle missing data and permit the incomplete knowledge about the biological system and (3) BNs are capable of integrating prior biological knowledge into the system.
Generally, a BN is a graphical representation of the dependence structure between multiple interacting quantities. This graphical representation is more commonly called a directed acyclic graph (DAG). The nodes or the vertices of the DAG represent the random variables in the network while the edges connecting the vertices represent the causal influence of one node on the other. To learn the structure of a network that describes the causal relationship among variables, one needs to have (1) a scoring function that assesses how well a network fits the data and (2) a search method to find networks with high scores.
A BN is a DAG that represents a joint probability distribution of a set of random variables {X_{1}, X_{2}, ..., X_{n}}. The nodes of the DAG are represented by this set of random variables. Let Pa_{i} denote the set of parents of X_{i} in the DAG. The DAG encodes the Markov relation which states the following: each variable X_{i} in the DAG is independent of its non-descendants given its set of parents in the DAG. Mathematically the joint distribution can be decomposed into a product form as
This is referred to as the chain rule for BNs. Learning a BN structure is to find a DAG that best matches the dataset. The common method of structure learning is to define a scoring function that evaluates how well the DAG explains the data and then to search for the best DAG that optimizes the scoring function. A commonly used scoring function for discrete data is called BDe scoring metric which computes the posterior probability of a network for the given data [17]. As the number of possible graph structures is super-exponential in 'n' (the number of nodes in the network), searching exhaustively over the space of the DAGs is computationally very challenging. Thus, heuristic methods are typically used. For more details about these methods the reader could see [18].
The simplest dynamic models – synchronous Boolean network models – were used as a model for gene regulatory networks already in the 1960's by Stuart Kauffman as [19]. Boolean networks are based on the assumption that binary on/off switches functioning in discrete time steps can describe important aspects of gene regulation. In synchronous Boolean network models all genes switch states simultaneously (Figure 1). We can introduce the concept of the state of the network defined as an n-tuple of 0s and 1s describing which genes in the network are or are not expressed at the particular moment (Figure 1). As time progresses, the network navigates through the 'state space', switching from one state to another, as shown in Figure 1 D. For a network of n genes, in total there are 2^{n} possible different states, for instance, for a three gene network the possible states are (0,0,0), (0,0,1), ..., (1,1,1). We can follow the succession of states with time and study which states are reached. Some states might never be reached. It is possible to look for attractors: these are states or series of states that once reached will not be left anymore. The small example network in Figure 1 has two attractors: one attractor is a single state (0,0,1), and the second attractor consists of two alternating states (1,0,1) and (0,1,0). This approach has been generalized in a number of ways. Randomly generated networks are used to study the dynamics of complex systems [21]. Stochastic extensions to deterministic Boolean networks were proposed so-called noisy networks by Akutsu et al. [22] and Probabilistic Boolean Networks by Shmulevich et al. [23].
Nonlinear ordinary differential equations are probably the most-widespread formalism for modeling genetic regulatory
networks. They represent the concentration of gene products mRNAs or protein by continuous, time-dependent variables, that is, x (t), t Є T, T being a closed time interval (T Є R≥0). The variables take their values from the set of nonnegative
real numbers (x: T → R≥0), reflecting the constraint that a concentration cannot be negative. In order to model the regulatory interactions between genes, functional or differential relations are used.
More precisely, gene regulation is modeled by a system of ordinary differential equations having the following form:
where x = [x1, . . . , xn]' is the vector of concentration variables of the system, and the function f = [f1, . . . , fn]', usually highly nonlinear, represents the regulatory interactions. The above system does not include the delays resulting from the time it takes to complete transcription, translation, and the other stages of the synthesis and the transport of proteins you can see [24] for more details.
The above definitions can be illustrated by means of a simple network in Figure 2. Each of the genes encodes a regulatory protein that inhibits the expression of the other gene, by binding to a site overlapping the promoter of the gene.
An ordinary differential equation model of the network in Figure 2 is shown in Figure 3. The variables x_{a} and x_{b} represent the concentration of proteins A and B, encoded by genes a and b, respectively. The temporal derivative of x_{a} is the difference between the synthesis term κah-(xb, θb,mb) and the degradation term γ_{a}x_{a}. The first term expresses that the rate of synthesis of protein A depends on the concentration of protein B and is described by the function h-. This so called Hill function is monotonically decreasing. It takes the value 1 for x_{b} =0, and asymptotically reaches 0 for x_{b} →∞. It is characterized by a threshold parameter θb and a cooperativity parameter mb (Figure 3.5b). For mb > 1, the Hill function has a sigmoidal form that is often observed experimentally [26]. The synthesis term κah-(xb, θb,mb) thus means that, for low concentrations of protein B, gene a is expressed at a rate close to its maximum rate κa (κa > 0), whereas for high concentrations of B, the expression of the gene is almost completely repressed. The second term of the differential equation, the degradation term, expresses that the degradation rate of protein A is proportional to its own concentration x_{a}, γ_{a} being a degradation parameter (γ_{a} > 0). Unfortunately, they are difficult to treat mathematically for networks comprising more than two genes, in which case we have to take recourse to numerical simulation. However, the application of numerical techniques is often difficult in practice, due to the absence of numerical values for the parameters in the model. A possible alternative is the use of linear ordinary differential equations. Powerful techniques for solving these equations exist, as well as techniques for estimating parameter values from experimental data.
If two genes show similar expression profiles, they are supposed to follow the same regulatory regime. To put it more pointedly: coexpression hints at coregulation. Coexpression networks (also known as relevance networks) are constructed by computing a similarity score for each pair of genes. If similarity is above a certain threshold, the gene pair gets connected in the graph, if not, it remains unconnected. Wolfe et al. [27] argue that networks of coexpressed genes provide a widely applicable framework for assigning gene function.
They show that coexpression agrees well with functional similarity as it is encoded in the Gene Ontology [28]. The first critical point in building a coexpression network is how to formalize the notion of similarity of expression profiles. Several measures have been proposed, the most simple of which is correlation. In a Gaussian model, zero correlation corresponds to statistical independence. The second critical step in building coexpression networks is assessing the significance of results. Many pairs of genes show similar behavior in expression profiles by chance even though they are not biologically related. Even high similarity of expression tells us little about the underlying biological mechanisms. Coexpression networks include regulatory relationships, but we cannot distinguish direct from indirect dependencies based on the similarity of expression patterns. Figure 4 exemplifies this problem on a small set of three highly coexpressed genes, which form a clique (a completely connected subgraph) in a coexpression network. The figure shows that several regulatory mechanisms can explain this observation, and from coexpression data alone we have no way of choosing between them. There are two possible solutions. Functional genomics has a long tradition of perturbing the natural state of a cell and inferring a gene's function from the observed effects. These interventions allow us to distinguish between the three models in Figure 4, because each model results in different predictions of effects, which can be compared to thoseobtained in experiments. For example, perturbing gene Y in the cascade X → Y → Z will only have an effect on gene Z but none on gene X. In the case where Y regulates both X and Z, perturbing it will result in changes at both regulatees. In the last case, where all three genes are regulated by a hidden regulator, perturbing one of them will not lead to changes at the other two. In the absence of perturbation data statistical methods may be used to find which of the possibilities is most likely. The theoretical background is the concept of conditional independence. Please see [2] for more details.
A powerful approach to test our understanding of gene regulatory networks is to build new networks from scratch in an approach called synthetic biology. Then we could compare model predictions with networks output. This approach allows us to investigate in depth the effect of noise, data size and hidden variables in the form of unobserved processes on the reconstruction of gene regulatory networks [29].
DREAM is a Dialogue for Reverse Engineering Assessments and Methods. Its main objective is to catalyze the interaction between experiment and theory in the area of cellular network inference. The fundamental question for DREAM is simple: How can researchers assess how well they are describing the networks of interacting molecules that underlie biological systems? The answer is not so simple. Researchers have used a variety of algorithms to deduce the structure of very different biological and artificial networks, and evaluated their success using various metrics. What is still needed, and what DREAM aims to achieve, is a fair comparison of the strengths and weaknesses of the methods and a clear sense of the reliability of the network models they produce. The reader could refer to the recent previous DREAM conference meeting to look for new reverse engineering approaches [30]. The purpose of DREAM is not to produce the best possible network, but to evaluate the best tools for producing networks. The choice of tools depends in part on the nature of the available data. The uploaded results with DREAM2 challenge show that the networks inferred from the data differed significantly from the real network, which is precisely known. What is not known is whether the data given are, by themselves, sufficient to distinguish the networks.
An interesting blinded competition on DREAM3 2008 assess the ability of scientists and their computer servants to infer networks from experimental data, by comparing their predictions to "gold-standard" networks whose structure is thought to be known. Predictors could know their ranking online [30].
At the basis of any modelling, including network modelling, there is a realisation and acceptance that a model describes only some properties of the 'real world' system, and ignores others. Thus it emphasizes particular aspects of reality, leaving out details that are not relevant for the purpose of the study. How far are we from being able to build realistic cell models? The availability of large-scale data sets such as microarray gene expression and genomic localisation data triggered the search for suitable approaches to model complex biological systems. By prediction gene network just from gene expression data we were ignoring the last 30 years of molecular biology literature in the design of the network. The question is how to make predictions in addition of what is known. We need also to standardize the methods of comparing gene network models. What is not known is whether the data given are, by themselves, sufficient to distinguish the networks. Finally from the Gene Ontology project the function of about one third of all genes is still unknown for the yeast Saccharomyces cerevisiae despite it being one of the best-studied organisms. And even for many of the better-known genes and core processes that have been studied for decades, like the cell cycle,there is still not enough data available to exactly know all changes in concentration and activation patterns. Currently it seems not feasible to simulate even relatively simple cells like yeast. Mechanisms like RNA interference, regulated degradation of mRNAs and proteins, chemical modifications of key molecules and others might play a larger role than anticipated in current models, other processes might still be unknown. It is obvious that the separation into gene regulatory networks, metabolic networks and protein interaction networks is possible only up to a certain degree. To what extent can the transcription regulation networks be decoupled from other networks, such as signal transduction networks? We need to integrate many types of information if we want to build realistic dynamic models; however, for current modelling approaches we have to limit the complexity of the systems we are dealing with.
Figure 1: Example for a small Boolean network consisting of 3 genes X, Y, Z. There are different ways for representing the network: A as a graph, B Boolean rules for state transitions, C a complete table of all possible states before and after transition, or D as a graph representing the state transitions. Reproduced from [20]. |
Figure 2: Example of a simple genetic regulatory network, composed of two genes a and b, the proteins A and B, and their regulatory interactions. The notation is inspired by the graphical conventions proposed by Kohn [25]. |
Figure 3: (a) Nonlinear ordinary differential equation model of the mutual- inhibition network (Figure 1). The variables x_{a} and x_{b} correspond to the concentrations of proteins A and B, respectively, parameters κ_{a} and κ_{b} to the synthesis rates of the proteins, parameters γ_{a} and γ_{b} to the degradation constants, parameters θ_{a} and θ_{b} to the threshold concentrations, and parameters m_{a} and m_{b} to the degree of cooperativity of the interactions. All parameters are positive. (b) Graphical representation of the characteristic sigmoidal form, form > 1, of the Hill function h^{-}(x, θ, m) |
Figure 4: Different mechanisms can explain coexpression. The left plot in the dashed box shows three coexpressed genes forming a clique in the coexpression graph. The other three plots show possible regulatory relationships that can explain coexpression: The genes could be regulated in a cascade (left), or one regulates both others (middle), or there is a common "hidden" regulator (right), which is not part of the model. Reproduce from [2]. |
Static(s)/dynamic(d) | Discreet(d)/Continuous(c) | Deterministic(d)/Stochastic(s) | Qualitative(ql)quantitative(qn) | |
---|---|---|---|---|
BYN | s | d,c | s | qn |
BNN | d | d | d | ql |
NLDE | d | c | d | qn |
AN | s | c | d | qn |
Table 1: Comparison between GRN modeling approaches |
Model | Data needed |
---|---|
Boolean, fully connected | 2N |
Boolean, connectivity K | 2Klog(N) ? |
Boolean, connectivity K, lin. sep. | Klog(N/K) |
Continuous, fully connected, additive | N |
Continuous, connectivity K, additive | Klog(N/K) ? |
Pairwise correlation | log(N) |
Table 2: Data size to recover gene networks with N genes and Connectivity K: at most K regulatory inputs per gene.Fully connected: each gene can receive regulatory inputs from all other genes; lin. Sep: linearly separable, for Boolean functions; additive: regulation can be modeled as a weighted sum. |