A Linear Mixed Model Spline Framework for Analysing Time Course ‘Omics’ Data

Jasmin Straube; Alain-Dominique Gorse; PROOF Centre of Excellence Team; Bevan Emma Huang; Kim-Anh Lê Cao

doi:10.1371/journal.pone.0134540

Abstract

Time course ‘omics’ experiments are becoming increasingly important to study system-wide dynamic regulation. Despite their high information content, analysis remains challenging. ‘Omics’ technologies capture quantitative measurements on tens of thousands of molecules. Therefore, in a time course ‘omics’ experiment molecules are measured for multiple subjects over multiple time points. This results in a large, high-dimensional dataset, which requires computationally efficient approaches for statistical analysis. Moreover, methods need to be able to handle missing values and various levels of noise. We present a novel, robust and powerful framework to analyze time course ‘omics’ data that consists of three stages: quality assessment and filtering, profile modelling, and analysis. The first step consists of removing molecules for which expression or abundance is highly variable over time. The second step models each molecular expression profile in a linear mixed model framework which takes into account subject-specific variability. The best model is selected through a serial model selection approach and results in dimension reduction of the time course data. The final step includes two types of analysis of the modelled trajectories, namely, clustering analysis to identify groups of correlated profiles over time, and differential expression analysis to identify profiles which differ over time and/or between treatment groups. Through simulation studies we demonstrate the high sensitivity and specificity of our approach for differential expression analysis. We then illustrate how our framework can bring novel insights on two time course ‘omics’ studies in breast cancer and kidney rejection. The methods are publicly available, implemented in the R CRAN package lmms.

Citation: Straube J, Gorse A-D, PROOF Centre of Excellence Team, Huang BE, Lê Cao K-A (2015) A Linear Mixed Model Spline Framework for Analysing Time Course ‘Omics’ Data. PLoS ONE 10(8): e0134540. https://doi.org/10.1371/journal.pone.0134540

Editor: Eugene Andres Houseman, Oregon State University, UNITED STATES

Received: March 4, 2015; Accepted: July 11, 2015; Published: August 27, 2015

Copyright: © 2015 Straube et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: D. paradoxus evolution (GSE36253) and M. musculus chemoimmunotherapy (GSE27440) data are available via the NCBI Gene Expression Omnibus (GEO). iTraq breast cancer data are available via the EBI PRIDE PRoteomics IDEntifications database (PRD000178). iTraq kidney rejection data are from the Freue et al. study (10.1371/journal.pcbi.1002963), and requests to access this data can be submitted to Dr. Gabriela V. Cohen Freue (gcohen@stat.ubc.ca) and Dr. Raymond T. Ng (rmcmaster@exchange.ubc.ca).

Funding: This work was supported by the Wound Management Innovation established and supported under the Australian Government’s Cooperative Research Centres Program to JS; the Australian Cancer Research Foundation for the Diamantina Individualised Oncology Care Centre at The University of Queensland Diamantina Institute to KALC; and the Australian Research Council Discovery Early Career Researcher Award (project number DE120101127) to EH. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Over the past decade, the use of ‘omics’ to take a snapshot of molecular behaviour has become ubiquitous. It has recently become possible to examine a series of such snapshots by measuring an ‘ome’ over time. This provides a powerful tool to study stressor-induced molecular behaviour [1], developmental processes (e.g., ageing; [2]) and cyclic mechanisms (e.g., cell cycle; [3]).

Robust and powerful analysis tools are critical for capitalizing on the wealth of data to answer key questions about system response and function. In addition to addressing the high-dimensionality of the data, such tools must account for a high number of missing values, and also variability within and between studied subjects. Many methods are limited by scale, and are unable to handle either a large number of time points, a varying number of time points per subject [4] or a very large number of molecules [5]. Hence there is an urgent need for filtering and modelling these time course data, not only to decrease the number of profiles analyzed, but also to collapse subject-specific profiles to a summary thereof.

The benefit of decreasing the number of profiles analyzed via filtering is evident when considering the scale of typical time course ‘omics’ experiments. Tens of thousands of molecules can be measured at different time points, requiring multiple hypothesis tests to determine differential expression. While the false positive rate can be controlled using multiple testing corrections (e.g., FDR; [6]), these are frequently accompanied by an increase in the false negative rate. Hence identifying and removing non-informative molecules prior to testing can help to increase statistical power. This drives a need for accurate approaches to remove a large number of non-informative profiles. Indeed, estimates are that only 30–40% of the genes are expressed at array-detectable levels [7], increasing up to 60–70% for newer technologies like RNA sequencing [8]. Furthermore, modelling can provide considerable benefits by summarizing the remaining, informative profiles. Our aim in this study is to model the systematic process from which expression levels derive, as a smooth function over time, so that observed measurements can then be seen as a noisy realization of this function.

A popular modelling approach for time course data is smoothing splines, which use a piecewise polynomial function with a penalty term [9]. The two main drawbacks are the arbitrary selection of the penalty and the computational burden, both of which have received extensive attention. For example, [10] reparametrized smoothing splines in a linear mixed model spline framework to address the arbitrary choice of penalty. However, the smoothing splines models developed in this framework are still computationally challenging to fit with an increasing number of time points [11, 12]. The standard smoothing splines approach faces similar challenges, which can in part be mitigated using spline regression. There the computation-limiting factor is the number of polynomial pieces rather than the number of time points. Since splines can be calculated using linear mixed models, a wide range of methods have been proposed to improve computational efficiency of such models [13], [14]. More recently, [15] presented a tradeoff between spline regression and the linear mixed model spline framework by combining low-rank smoothers adapted from [16] with the penalty approach of [13]. The hybrid approach results in a truncated spline basis which improves computational efficiency and relaxes the importance of initial parameters choices.

After the filtering and modelling steps, the resulting summarized profiles can be clustered to gain biological insight from their similarities. Indeed, clusters of correlated activity patterns may predict putative functions for molecules and reveal stage- and tissue-specific regulators [2]. To that end, several spline-based clustering methods have been proposed in the literature [17, 18]. However, common limitations include additional assumptions on the distribution of the data, computational cost and dependency of the resulting clusters on the initial parameters. To our knowledge, no approach currently incorporates subject-specific random effects in a spline model in order to accurately model subject-specific variation before clustering.

Hypothesis testing can also be performed within the mixed effect model framework to gain biological insight from differences between groups and across time. Several methods have been proposed which can all handle missing data and different numbers of replicates per time point, but are often limited when only a few time points are observed, as is typically the case for costly high-throughput experiments. Approaches such as linear models for microarray data (LIMMA; [19]) test contrasts of interest in a spline framework using an empirical Bayes approach [20], but do not account for subject-specific variation in the model. Extraction and analysis of Differential Gene Expression (EDGE; [21]) does model subject-specific effects as scalar shifts from the mean population response but lacks flexibility and has been reported to not adequately model data in simulated scenarios [22]. A more flexible approach is Smoothing Splines Mixed Effects (SME; [22]), which models subject-specific effects as full curves, but with the risk of over-smoothing profiles in some cases.

In this paper we propose a novel framework for time course ‘omics’ studies which is summarized in Fig 1). First, we extend a quality assessment and filtering approach to time course data to identify and remove non-informative molecular profiles. Second, we propose a serial modelling approach which avoids both under- and over- smoothing by allowing the data to drive the complexity of the curve in order to fit the appropriate model. These modelled and summarized profiles can then be analyzed for clustering and differential expression analyses. We illustrate the use of our framework in simulation and real time course ‘omics’ case studies.

Download:

Fig 1. Overview of the analysis framework.

The proposed framework consists of three stages: quality control and filtering; serial modelling of profiles; and analysis with clustering to identify similarities between profiles or with hypothesis testing to identify differences over time, between groups, and/or in group and time interactions.

https://doi.org/10.1371/journal.pone.0134540.g001

Material and Methods

Material

We first applied the filtering and modelling stages of our framework to two publicly available transcriptomics datasets, which are briefly described below. The main analyses and biological interpretations were then performed on two proteomics datasets from breast cancer and kidney rejection studies.

S. paradoxus evolution data (GSE36253).

The evolutionary principles of modular gene regulation in yeast were investigated by [23]. They tracked growth on glucose in real-time by measuring the growth rate, glucose, and ethanol levels. Expression of 5,503 genes was measured at six physiologically comparable time points. Samples were hybridized to microarrays with the reference chosen to be the same physiological phase in all cases. In this study we selected a single species (S. paradoxus) with two to four biological replicates per time point.

M. musculus chemoimmunotherapy data (GSE27440).

The anti-tumour efficiency of a chemotherapeutic drug on bone marrow in mice was investigated by [24]. Expression of 13,443 genes was measured pre-treatment, 1, 2 and 5 days after chemotherapy of tumour-bearing mice. At each time point five biological and two technical replicates were assayed.

iTraq breast cancer data (PRD000178).

Proteomic changes in MCF-7 cells resulting from insulin-like growth factor 1 (IGF-1) stimulation were investigated by [1]. As impairment of the IGF-1 receptor signalling network is involved in tumour growth and chemotherapy resistance, the study of proteins involved in this network may help to understand the underlying mechanisms and to identify potential drug targets. iTraq Liquid Chromatography followed by a two-dimensional Mass Spectrometry scan (LC-MS/MS) was used to quantify proteins at 0 h (no IGF-1), 6, 12 and 24 h after IGF-1 stimulation. This procedure was repeated in three separate cultures. In total 899 proteins were identified. Sample-wise scaled log₂ fold changes for time points 6, 12 and 24 h relative to baseline (0 h) were reported for 264 proteins with minimum two measured replicates. We applied our full data-driven modelling approach to this dataset, finishing with cluster analysis to explore patterns of protein response to IGF-1 stimulation.

iTraq kidney rejection data.

The PROOF Centre of Excellence performed a longitudinal study to identify diagnostic biomarkers in blood plasma to predict acute renal allograft rejection [25]. The iTraq kidney rejection dataset is a subsample thereof which includes 10 Acute Rejection patients (AR) and 20 Non-Rejection patients (NR). In this discovery study, iTraq MALDI-TOF MS/MS technology was used to quantify plasma protein relative concentrations in blood samples tracked prior to (0 weeks) and post transplant at 0.5, 1, 2, 3, and 4 weeks. In total, 140 proteins were quantified from blood samples. We applied our full data-driven modelling approach to this dataset, finishing with differential expression analysis to identify proteins whose profiles differed between the two groups.

Simulated data.

For each of six different scenarios varying noise levels and fold changes, we simulated 100 datasets, each consisting of 140 profiles, 50 of which were differentially expressed. For each dataset, we applied our differential expression approach and LIMMA [19], and compared their sensitivity and specificity to differential expression over time, between groups and in group*time interactions. A detailed description of the simulation procedure can be found in the Supporting Information files, with examples of simulated profiles (Figure A in S1 File).

Methods

Quality control and filtering

Filtering on the overall standard deviation of molecule expression is a common approach in static gene expression experiments to remove non-informative molecules prior to analysis [26]. The justification is that low standard deviations indicate little molecular activity, and so molecules which vary more are of more interest. In time course experiments however, molecules can vary both over time and between subjects. Therefore, an increase in the overall standard deviation does not necessarily indicate interesting molecular behaviour and the additional time dimension of the data needs to be accounted for.

Rather than the overall standard deviation, defined below as s_M, we considered two filter ratios based on the standard deviations across time and subjects. These estimates can be used to identify low quality and/or non-informative profiles. Let T be the number of time points and n the total number of subjects. For each molecule, we denote by y_i(t) the expression for subject i at time t, with i = 1, …, n and by s_T the average of standard deviations (SD) computed per time point with Similarly, s_I is the average of SDs computed per subject, with and s_M is the SD for each molecule, over all subjects and time points: Missing values were excluded from the relevant sums. We then define the filter ratios R_T and R_I as Our filter ratios are motivated by the expectation that the SD values for profiles consisting purely of noise are different compared to those with a true signal over time. Fig 2 illustrates some example profiles to motivate the use of one of the ratios, R_T, for quality control. The first type of profile consists purely of noise, resulting in s_T ≈ s_M and therefore R_T ≈ 1. The second type of profile has a true signal over time, resulting in s_M greater than s_T and R_T < 1. Hence, R_T provides one means of discriminating between non-informative and informative profiles. We generally expect subject-specific profiles to be close to the mean molecule profile, resulting in R_I ≈ 0, as would also be true for noisy profiles over time. Therefore, on its own, R_I is only a good discriminator of unambiguously flat profiles, for which s_I may often be smaller than s_M, resulting in R_I > 0. Nevertheless, the combination of both R_T and R_I can provide additional insights into the variance structure of the molecules and can guide the user to make more informed choices about filter ratio thresholds as illustrated in our case studies.

Download:

Fig 2. Examples of ‘noisy’ and differentially expressed profiles.

Profiles changing over time (blue) have a mean of the standard deviations per time point (s_T) smaller than the mean of the standard deviations per molecule (s_M), while these means have similar values for noisy molecules (brown). In both cases the mean of the standard deviations per subject (s_I) is similar to s_M.

https://doi.org/10.1371/journal.pone.0134540.g002

During our filtering stage, we first removed molecules with more than 50% missing data and applied model-based clustering (R package mclust [27]) on the filter ratios R_T and R_I by specifying two clusters. Based on the rationale described above, we expect the cluster of profiles with low R_T and R_I to be informative and propose to discard profiles in the cluster with high R_T and R_I. In the specific case where a time course study includes the comparison of multiple conditions or treatments, it is important to avoid filtering profiles which may be non-informative within a condition but are differentially expressed between conditions. Therefore, we propose to apply the filtering approach to each condition separately, with the additional requirement that profiles must be found non-informative in all conditions in order to be removed.

Modelling

In high-throughput experiments, thousands of molecule profiles need to be modelled in an efficient manner. Biological variability both between and within subjects must be accounted for, and experimental procedures typically result in different numbers of replicated measurements per molecule and time point. The combination of all of these factors requires a flexible, robust model-fitting procedure which can easily accommodate different sources of variation.

Model fit with Linear Mixed Model Splines (LMMS): For each molecule, we determine an appropriate model via a serial model fitting approach. This avoids under- or over-fitting by allowing the data structure to drive the model complexity, rather than relying on a priori assumptions such as in [19], [21]. We make comparisons between successive models using a goodness of fit test, retaining a more complex model only if it fits the data better than a simpler model. The goodness of fit is assessed with the log likelihood ratio test as implemented in the anova function of the nlme package. The four models considered in this process are described below, listed in order of increasing complexity.

The first model assumes the response is a straight line and is not affected by subject variation. For each molecule, we denote by y_ij(t_ij) its expression for subject (or biological replicate) i at time t_ij, where i = 1, 2, …, n, j = 1, 2, …, m_i, n is the sample size and m_i is the number of observations for subject i. We fit a simple linear regression of expression y_ij(t_ij) on time t_ij, where the intercept β₀ and slope β₁ are estimated via ordinary least squares: (1)

As nonlinear response patterns are commonly encountered in time course biological data [28], our second model replaces the straight line in Eq 1 with a curve that is modelled using a spline truncated line basis as proposed by [15]: (2) In Eq 2 f represents a penalized spline which depends on a set of knot positions κ₁, …, κ_K in the range of {t_ij}, some unknown coefficients u_k to be estimated, an intercept β₀ and a slope β₁. That is, (3) Since a spline is a composition of curve segments and the knots define the break points of the curve segments, the choice of the number of knots K and their positions influences the shape of the curve. As proposed by [29], we estimate the number of knots based on the number of measured time points T as , setting knots κ₁…κ_K at quantiles of the time interval of interest.

In order to account for subject variation, our third model Eq (4) adds a subject-specific random effect U_i to the mean response f(t_ij). Assuming f(t_ij) to be a fixed (yet unknown) population curve, U_i is treated as a random realization from an underlying Gaussian distribution independent from the previously defined random error term ϵ_ij. Hence, the subject-specific curves are expected to be parallel to the mean curve as we assume the subject-specific random effects to be constant over time: (4)

A simple extension to this model is to assume that the subject-specific deviations are straight lines. Our fourth model therefore fits subject-specific random intercepts a_i0 and slopes a_i1: (5) We assume independence between the random intercept and slope, and therefore the covariance matrix for the random effects Σ is diagonal.

Derivative information for Linear Mixed Model Splines (DLMMS): The derivative of expression profiles contains valuable information about the rate of change of expression over time [9, 30]. We consider the derivative of the mean population curve f(t) from Eq 3. Note that for profiles modelled using only Eq 1 the derivative is constant and is equal to the estimate of the slope. Otherwise, the derivative at any time point t in the relevant time interval is: where and denote the estimates of the intercept and spline coefficients. The derivatives of the LMMS profiles can then be used instead of the modelled profiles to gain new insights in the downstream cluster analysis.

Clustering

Clustering of time profiles allows insight into which molecules share similar patterns of response, which may in turn indicate a shared biological basis. Similarities between trajectories may be seen not only in terms of shape and magnitude, but also rates of change, or speed. However, detecting these similarities can be challenging due to noise and missing values in subject-specific measurements. Hence, the choice of modelling approach often has critical impact on the ability to identify clusters of biologically similar molecules.

We compared our modelling approaches LMMS and DLMMS to two single-step models using the workflow shown in Fig 3. As a basic comparison, we first calculated the mean at each time point for each molecule as it is arguably the most common way of reducing subject dimension. As a more sophisticated alternative, we applied the R package implementation of the recently proposed modelling approach Smoothing Splines Mixed Effects (SME) [22], which uses a single model that treats each subject-specific trajectory as a smooth function of time.

Download:

Fig 3. Workflow for the profile cluster analysis.

Trajectories derived from Linear Mixed Model Spline (LMMS) and Derivative Linear Mixed Model Spline (DLMMS) were compared to trajectories derived either from the mean or Smoothing Splines Mixed Effects (SME) models. Five clustering algorithms—hierarchical clustering (HC), kmeans (KM), Self-Organizing Maps (SOM), model-based (model) and Partitioning Around Medoids (PAM) were then applied on modelled trajectories using a range of two to nine clusters. The performance of each algorithm was assessed using the Dunn index. Gene Ontology (GO) term enrichment analysis was performed on each of the obtained clusters.

https://doi.org/10.1371/journal.pone.0134540.g003

For clustering, we compared the performance of five algorithms using the Dunn index [31] from the clValid R package [32]. The Dunn index is the ratio of the smallest inter-cluster distance to the largest intra-cluster distance. A large index value indicates a good separation of the clusters, and is our criterion of choice to determine both the appropriate number of clusters and the best performing clustering algorithm.

We selected clustering algorithms for comparison based on representatives of different classes of standard techniques: a model-based algorithm (mclust; [27]), hierarchical clustering, k-means, partitioning around medoids (cluster; [33]), and Self-Organizing Maps (kohonen; [34]). The last four algorithms utilize a dissimilarity metric to cluster profiles derived from SME, mean and LMMS and the Euclidean distance metric for DLMMS.

A size-based Gene Ontology (GO) term enrichment analysis was then performed to validate the biological relevance of each cluster, using the hypergeometric distribution based on the number of molecules in the domain of interest [35]. We specifically examined the molecules’ spatial link (Cellular Compartment), basal activity (Molecular Function) and involvement in a series of molecular events (Biological Process). All annotations were obtained from the org.Hs.eg.db R package [36].

Differential expression analysis

While cluster analysis can provide valuable insight into behaviour patterns common to groups (clusters) of molecules, differential expression analysis in a time course experiment can highlight significant responses to perturbations of each molecule. Our LMMS framework enables assessment of the significant differences over time or between individual groups based on the whole molecular trajectory instead of analysing individual time points.

LMMS for differential expression analysis (LMMSDE): We extended the LMMS modelling framework to test between groups, across time, and for interactions between groups and time as follows. Suppose we have R different groups of subjects, with h_i denoting the group for each subject i. Further, we define h_ir to be the indicator for the r^th group, that is, h_ir = 1 if h_i = r and 0 otherwise. Starting from the model in Eq 3 which is fit for a single group, we can extend our formulation to allow for variations to the mean curve depending on which group contains each subject. Thus the mean curve for each group f_{h_i} in the full LMMSDE model is given by: (6) For each r = 1, …, R, α₀ = α_0r are the differences in intercept between each group and the first group; α₁ = α_1r are the differences in slope between each group and the first group; and v_rk are the differences in spline coefficients between each group and the first group.

We can test different hypotheses depending on which parameters are equal to zero. Firstly, for a single group, ∀r > 1, we have h_ir = 0, and time effects will be detected only if the goodness of fit of this model is better than the null model which fits only the intercept. Secondly, to detect differences between groups, we set α₁ = 0 and β₁ = 0, and test a goodness of fit against the null model which also has h_ir = 0. Finally, if we include all parameters we can model the group * time interactions, by allowing different slopes and intercepts in the different groups. We compare this to the null model where the effects over time do not differ between groups. For each case we compared the fit of the expanded model from Eq (6) with the corresponding null model using the likelihood ratio test as implemented in the anova function from the R package nlme [37].

Comparisons with LIMMA: We compared our approach to LIMMA [19], which is a set of methods for microarray data analysis integrating empirical Bayes approaches with linear models. We tested for differences over time and between groups using the following two-step process. First, linear spline models were fitted over time for every group. Second, contrasts of coefficients from the fits were tested for significance. Correction for multiple testing was applied for both methods for a significance level of 0.05 using the FDR approach from [6]. Note that no filtering was performed before differential expression analysis for two reasons: first, we wanted to compare the results based only on differences between models, and not on differences in filtering approaches; second, p-values derived from LIMMA are based on posterior estimates and the removal of non-informative profiles before analysis would therefore bias the results.