The gene set analysis [ 14 — 16 ] is supposed to be more powerful than a gene-by-gene analysis because it can detect a change of expression of a group of genes although none of them show a very high absolute fold change. Furthermore, a change of all genes in a given pathway may be biologically more meaningful than a large increase of a single gene. Also, provided that the gene sets are well defined, the result should be more sound and comparable across studies than a gene-by-gene analysis [ 14 ]. The analysis of longitudinal microarray experiments through a gene set approach is not trivial because the dynamics of gene expressions inside a gene set can be complex and heterogeneous.
This has already been underlined in some of the approaches developed to analyze gene sets [ 15 , 18 — 20 ]. Fig 1 shows an example of a homogeneous gene set, whereas Fig 2 shows an example of a heterogeneous one. Actually, such a heterogeneity is frequently observed [ 20 ], and cannot be ignored, as genes inside a functional gene set are not expected to change their expression synchronously.
Moreover this heterogeneity can be biologically meaningful by itself. Prieto et al. They identified heterogeneous gene sets linked to acute promyelocytic leukemia. Another example is given by Hu et al. The main advantage of detecting the heterogeneity inside a gene set is to detect any change over time whatever the specification of the model for the trends. In other words, the dynamics of gene expression inside a stable gene set will be summarized by a flat slope and no heterogeneity.
Hence, in the spirit of [ 19 ], to find any significant change of the overall expression of genes inside a gene set over time, we suggest to look for any significant trend over time or any heterogeneity between gene trends inside the gene set. Each line is the median expression of a gene inside this particular gene set across all the patients. The expression of the genes inside this gene set is quite homogeneous and it is easy to identify a global time trend, displayed by the dashed black line smoothed median.
The expression of the genes inside this gene set is rather heterogeneous. This makes it difficult to identify any time trend, as the mean expression inside this gene set stays close to zero. However a closer look reveals two distinct time trends, displayed by the two respectively dashed and dotted black lines smoothed medians. Several approaches have already been proposed to analyze longitudinal measurements of gene expression [ 7 , 23 — 28 ], but only a few include gene set analysis [ 10 , 18 , 29 — 31 ].
Among the latter, all but Nueda et al. Unfortunately, it does not account for the structure of longitudinal data, simply treating all observations as independent and calculating Pearson correlation of the genes with the time variable. Therefore this is quite a limited modeling strategy for longitudinal data.
In practice, the global null hypothesis tested is quite flexible relying on the ANOVA framework, but cannot accommodate missing values. Wang et al. They considered a random effect for the array level rather than for the patient or the gene level. Zhang et al. Wu et al. However, CAMERA accounts for neither potential heterogeneity inside a gene set nor for repeated measurement correlation.
Of note, it is not possible to look at the change of gene expression in only one group using Zhang et al. It is based on a Principal Component Analysis PCA of each gene set followed by a linear regression of the significant principal components i. However, they did not consider time-course experiments where repeated measures are available for multiple patient.
Gene set analysis methods can also be distinguished by their choice of the null hypothesis.
Time-Course Gene Set Analysis for Longitudinal Gene Expression Data
Those can be classified into two main types of hypothesis: i the competitive null hypothesis, that tests the genes inside a given gene set against all the other genes outside the gene set; ii the self-contained null hypothesis, that only uses the genes inside the gene set of interest [ 20 , 33 , 34 ]. According to Emmert-Streib et al. Self-contained gene set tests are especially appropriate in a hypothesis driven context where a priori defined gene sets are validated and relevant regarding the biological question. We propose the implementation of a hypothesis driven method that directly tests the time-course significance of predefined gene sets: the Time-course Gene Set Analysis TcGSA.
It relies on the use of linear mixed effect models, a very useful and well established statistical tool [ 36 , 37 ] especially suited for longitudinal settings. By using all available repeated measures, it provides increased statistical power. TcGSA can accommodate for heterogeneity of gene expression within the gene sets through random effects, and is robust to unbalanced designed due to missing at random values thanks to the maximum likelihood estimates.
No previously proposed approach combines all of TcGSA features. A simulation study demonstrated the good statistical performance of the proposed method. It has been applied to two studies: one HIV vaccine trial, and one influenza and pneumoccocal vaccine study [ 6 ], using the same definition of gene sets [ 13 ] that is increasingly used in systems immunology research [ 38 — 42 ]. Compared to gene-by-gene analyses, TcGSA disclosed changes of additional gene sets that endorse previous conclusions [ 6 ], but also revealed common pathways across the three vaccines. The TcGSA method includes three steps: 1 modeling gene expression in a gene set with mixed models, 2 testing the significance of a gene set, and 3 estimating individual gene profiles.
Let S be a gene set of interest.
The expression of genes inside S is modeled over time according to a function f as:. Finally f g t i is a function of time, that can be linear, polynomials, etc. In this paper we focus on three forms for f g but other forms, such as exponential, etc. Alternatively, one can make the assumption that the patient effect is the same for all the genes. In that case, the random effect c is no longer grouped on the gene level, and the model can be written as:. This alternative modeling has the advantage to be more parsimonious than the Eq 1 , with less parameters to be estimated. The expression of genes inside S is modeled over time according to a function f g , m that is now stratified on the groups:.
In other words, we want to test the significance of the time trend while being sensitive to both homogeneous and heterogeneous changes of gene expression over time inside a gene set. Testing the significance of a given gene set S therefore means testing both fixed and random effects at once, in a single test. A likelihood ratio test is the natural way to do so, fitting models under both the null hypothesis and the alternative. In the case of one group experiment Eqs 1 and 1bis the null hypothesis H 0 is that the genes inside S are stable over time, i.
The alternative hypothesis H 1 is that the genes inside S vary significantly over time: 1. In the case of a multiple group experiment Eq 2 , the null hypothesis is that inside the gene set S , the evolution of gene expressions over time is the same regardless of the group. The alternative hypothesis is that time trends f are different depending on the group m : 2. In both case, one model is fitted under the null hypothesis, and one is fitted under the alternative.
The likelihood ratio is then computed. However, since both fixed and random effects are tested simultaneously in this likelihood ratio, its null distribution is not the standard chi-square distribution because of boundary constraints due to the variance of random effects. According to Self et al. This approximation implies that the tested random effects are independent of one another [ 45 — 47 ]. This seems an acceptable assumption according to our simulations under the null see S1 and S2 Figs. This allows to compute a p-value for the significance of the variation of a given gene set over time.
When several gene sets are investigated at a time, it is necessary to take into account multiple testing. A number of procedures are available to do so [ 48 ].
As the TcGSA is mostly an exploratory analysis even though hypothesis driven in the sense that gene set are defined a priori , we recommend using the Benjamini-Yekutieli procedure for controlling the False Discovery Rate [ 49 ], as gene sets are necessarily correlated between each others and this procedure is robust to some of these dependances.
In the estimation of linear mixed model, it is common to use the Restricted Maximum Likelihood REML instead of the classic Maximum Likelihood ML in order to avoid biased estimates of the variance components [ 50 ]. But note that REML cannot be used to estimate the likelihood ratios presented here. Indeed, REML estimation of the likelihood ratio between two models can only be used when both models have the same fixed part [ 51 ].
- A View From The Heart: Poems!
- Systems approach to engineering design?
- Analysis of RNA-Seq data: gene-level exploratory analysis and differential expression?
For the inference of the random effects, Best Linear Unbiased Predictor BLUP are used [ 52 ], giving access to estimations of a single profile for each gene among a gene set, in each patient. As a result, the estimations from the mixed model are shrunken towards the average expression inside the gene set. This shrinkage occurs when the residuals variability is relatively large compared to the the random effects estimated variances [ 52 ].
The mixed model uses the repeated pattern of the longitudinal measurements to structure the variation. Its estimations give smoother trajectories for the genes than the raw data, which makes the general evolution of the set clearer [ 53 ], as it can be seen in Figs 3 and 4. Each line is the median over the patients of the expression of one gene. Each graph shows all the genes in one particular gene set. The left graph displays the raw gene expression, the right one displays the estimations from the mixed model for the same gene set.
Comprehensive and Systematic Analysis of Gene Expression Patterns Associated with Body Mass Index
The expressions have been centered and reduced for this representation. If several dynamics are identified by the gap statistics among the estimated expressions inside one gene set, they are displayed in different colors—such as for the gene sets M 4. Once a gene set S has been identified as significant through the previous mixed likelihood ratio statistics, a summary of its dynamic over time is needed. However, due to the possible heterogeneity of S , giving a summary representation of S dynamic is not obvious. We propose to automatically identify the number of trends in a significant gene set from the fit of the model.
Predicted gene expressions from the linear mixed model are clustered, and the optimal number of trends is selected with the gap statistic [ 54 ]. It is a formalization of the elbow criterion for the within-cluster variance.
In order to determine the optimal partition of each gene set here, the gap statistics is applied onto a hierarchical clustering of gene expressions inside each gene set. Then the median within each of the identified clusters can summarize each trend. Therefore, gene sets are actually split when heterogeneous, before being summarized.
The predicted gene expression from the linear mixed model is used for this and not the observed expression because smoothness of trajectories facilitates classification [ 53 ]. Examples of such representations are given in Figs 3 and 4. Most often, TcGSA will be used to investigate a large number of gene sets from a few hundreds to a few thousands. This multiplicity can make visualization of the results more challenging, in addition of requiring a multiple testing correction.
TcGSA is designed to identify gene sets that shows a simultaneous evolution of gene expression, but possibly of a small intensity. The method can therefore be quite sensitive, and it can be of interest to rank the significant gene sets to identify the most acute signals.