Analysis of Gene Expression Profiles in the Human Brain Stem, Cerebellum and Cerebral Cortex

Lei Chen; Chen Chu; Yu-Hang Zhang; Changming Zhu; Xiangyin Kong; Tao Huang; Yu-Dong Cai

doi:10.1371/journal.pone.0159395

Abstract

The human brain is one of the most mysterious tissues in the body. Our knowledge of the human brain is limited due to the complexity of its structure and the microscopic nature of connections between brain regions and other tissues in the body. In this study, we analyzed the gene expression profiles of three brain regions—the brain stem, cerebellum and cerebral cortex—to identify genes that are differentially expressed among these different brain regions in humans and to obtain a list of robust, region-specific, differentially expressed genes by comparing the expression signatures from different individuals. Feature selection methods, specifically minimum redundancy maximum relevance and incremental feature selection, were employed to analyze the gene expression profiles. Sequential minimal optimization, a machine-learning algorithm, was employed to examine the utility of selected genes. We also performed a literature search, and we discuss the experimental evidence for the important physiological functions of several highly ranked genes, including NR2E1, DAO, and LRRC7, and we give our analyses on a gene (TFAP2B) that have not been investigated or experimentally validated. As a whole, the results of our study will improve our ability to predict and understand genes related to brain regionalization and function.

Citation: Chen L, Chu C, Zhang Y-H, Zhu C, Kong X, Huang T, et al. (2016) Analysis of Gene Expression Profiles in the Human Brain Stem, Cerebellum and Cerebral Cortex. PLoS ONE 11(7): e0159395. https://doi.org/10.1371/journal.pone.0159395

Editor: Ramani Ramchandran, Medical College of Wisconsin, UNITED STATES

Received: March 19, 2016; Accepted: July 1, 2016; Published: July 19, 2016

Copyright: © 2016 Chen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: This study was supported by the National Natural Science Foundation of China (61202021, 31371335), Shanghai Natural Science Foundation (16ZR1414500), Shanghai Sailing Program and The Youth Innovation Promotion Association of Chinese Academy of Sciences (CAS) (2016245). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Human brains are distinguished from those of all other species by their incomparable cognitive capacities. Throughout human history, people have questioned how this mysterious and powerful organ functions in such a highly orchestrated manner. Studies of the human brain using cellular and molecular biological techniques have been undertaken for generations, but the mechanisms underlying the development, differentiation and function of the human brain remain quite elusive. It is widely accepted that fine-tuned spatiotemporal gene expression contributes to the proper function of individual tissues [1]. With the advancement of high-throughput technologies, brain transcriptomics studies have gained more attention and have given rise to a large amount of brain gene expression data over the past decade. Comparative studies of gene expression in the brains of different species have shown that the divergence in brain gene expression patterns has contributed to the changes in human brain function during evolution [2], and some human-specific brain gene expression patterns have been gradually revealed [3–5]. Spatially regulated gene expression is another feature of the human brain that is closely related to its development, differentiation and region-specific functions. Previous studies compared brain regional transcriptomes among different species and revealed a correlation between brain region-specific gene expression and function from an evolutionary perspective [6–8]. However, much less is known about intra-specific brain gene expression in humans and how the divergence in gene expression patterns of human brains contributes to differences based on age, race, sex, and other factors. Financial limitations and the limited availability of human samples usually impede the amount of data acquired or analyzed in a given study, and therefore, it is of great significance to perform data mining of previous datasets [9].

In 2013, the Human Brain Project (HBP), a large, 10-year research project, was established to unravel the mystery of human brains. One of its goals is to organize neuroscience data using a variety of computational models and analytical tools and to better serve the integration and analysis of increasingly large-volume, high-dimensional, and fine-grained experimental data. Despite the progress the HBP has made in the past two years, many obstacles in the methodology (data integration, computation, engineering, etc.) and overall research paradigm must still be addressed [10]. In addition to the HBP, several other previous or ongoing research projects also provide information regarding human brains. For example, the Allen Human Brain Atlas serves as an integrative and powerful database of brain gene expression data with high anatomical resolution [11] and has provided multimodal data-mining resources for many previous studies [9, 12, 13].

In this study, we further explore the universalities of regional gene expression of human brains, as well as the differential gene expression patterns among individuals of different ages, races and sexes. Compared to previous studies, we sought to better explore region-specific brain gene expression by minimizing the differences among samples. Considering data availability and reliability, we obtained gene expression data from the Allen Human Brain Atlas database. In this study, we employed several feature-selection methods, including minimum redundancy maximum relevance (mRMR) and incremental feature selection (IFS) [14], and a machine learning method, the sequential minimal optimization (SMO) algorithm, [15, 16] to analyze the gene expression profiles of the brain stem (BS), cerebellum (CB) and cerebral cortex (CC) in the brains of six different people. We also performed a literature search to gather evidence to support our analysis.

Materials and Methods

Materials

We downloaded the gene expression profiles of BS, CB and CC samples from six people (H0351.2001, H0351.2002, H0351.1009, H0351.1012, H0351.1015 and H0351.1016) from the Allen Brain Atlas [11] (http://human.brain-map.org/static/download). Detailed information regarding these six individuals is provided in S1 Text, which can be downloaded from http://help.brain-map.org/download/attachments/2818165/CaseQual_and_DonorProfiles.pdf?version=1&modificationDate=1382051848013. The number of samples from each region available for each person can be found in Table 1. Depending on the brain region from which the sample was obtained, samples from one person can be divided into three classes, BS, CB and CC, and comprise a dataset. For each sample, the expression levels of 20,782 genes were measured using microarray analysis. And the expression level of each gene was the feature used for classifying the samples from different brain regions.

Download:

Table 1. The distribution of samples, obtained from six individuals, among three regions of the human brain.

https://doi.org/10.1371/journal.pone.0159395.t001

The goal of this study was to identify the differentially expressed genes among different brain regions in each person and then obtain a list of robust, region-specific, differentially expressed genes by comparing the expression signatures from different persons. Each of the six datasets was used as the training dataset for one analysis in which the remaining five datasets were used as test datasets.

For each individual, there were samples from three brain regions including BS, CB and CC. And we try to identify the genes that can classify them.

For cross-individual analysis, we used the discriminative genes identified from one individual to classify the samples in other five individuals. The cross-individual analysis could measure the robustness of the discriminative genes and obtain a brain region specific, but not individual specific discriminative gene list. The individual differences could be a confounding factor in brain region specific analysis.

mRMR method

To identify differentially expressed genes, we used a popular feature selection method, mRMR, to analyze the gene expression profiles of three regions in the human brain. The mRMR method, proposed by Peng et al. [14], was developed based on two criteria: Max-Relevance and Min-Redundancy. The output of the mRMR program contains two feature lists, the MaxRel feature list and the mRMR feature list, in which all features are sorted. Max-Relevance guarantees that the features that correlate strongly with the target variable receive high ranks, while Min-Redundancy guarantees that a feature with low redundancy to features already in the list is selected in the next round. For the two obtained feature lists, the MaxRel feature list was produced based only on the Max-Relevance criterion, and the mRMR feature list was produced based on both criteria. We define these two feature lists as follows: (1)

The MaxRel feature list can provide clues to assess which features are important for distinguishing samples from different classes. However, the combination of a number of top features in this list is not always an optimal combination for classification because redundancies may occur between them. On the other hand, the mRMR feature list considers these factors. Thus, the mRMR feature list is more appropriate for building an optimal classification model and extracting an optimal combination of features. MaxRel and mRMR feature lists have been widely used to address a variety of biological problems [17–26].

Prediction engine

The mRMR method described above only provides lists of features. To extract important features (genes), a prediction engine is necessary, and it, together with the mRMR feature list, was used according to the method described below. In this study, we adopted a powerful machine learning method, SMO, as the prediction engine. SMO is proposed by John Platt [15, 16] and is one of the most popular methods for solving support vector machines (SVMs) in the dual space. It is a type of decomposition method and always uses the smallest possible working set, which contains two dual variables and can be updated very effectively. The optimization problem is divided into a number of smallest possible sub-problems, which are solved analytically [27]. Nowadays, it has been used in many algorithms, especially for C-SVM (SVM for classification) [28–30]. Some published papers have validated the convergence behavior of SMO [31–33]. All of these indicate that SMO is a good method to optimize solving procedures of SVM. Because this study involved three classes, pair-wise coupling [34] was applied to build the multi-class classifier.

In Weka [35], a classifier called SMO implements the SMO method described above. For convenience, this classifier was directly employed as the prediction engine and was used with its default parameters.

Revised IFS method

The original IFS method, which was proposed by Peng et al. [14], used the mRMR feature list and a basic prediction engine to extract important features and develop an optimal prediction. All of these procedures are executed only on the training dataset. In this study, we mainly focused on finding differentially expressed genes identified from one individual to classify the samples in other five individuals, but not individual specific differentially expressed genes. We modified the original IFS method by executing it on both the training set and five test datasets. A detailed description is presented below:

According to the mRMR feature list , N feature sets, denoted as F₁,F₂,…,F_N, can be constructed as , i.e., F_i contains the first i features in the mRMR feature list.
For each F_i, samples in the training dataset and five test datasets were all represented by features in F_i. Then, the classifier SMO was trained on the training set, and its performance was evaluated using five test datasets. The predicted results for each of the test datasets were counted as the total prediction accuracy and accuracy of each class.
For each test dataset, an IFS curve was plotted by setting the total prediction accuracy as the Y-axis and the number of features used as the X-axis.

Theoretically, for a training dataset and a test dataset, we want to find the optimal combination of genes that can accurately evaluate the differences between two people. Thus, a feature set that results in the maximum total prediction accuracy on the test dataset represents the optimal combination of genes for which we are searching. However, this feature set always contains an extremely large number of features (genes), which complicates the analysis. In this study, an inflection point was identified on each IFS curve. The inflection point is defined as the first point on the curve for which the total prediction accuracy at this point is greater than or equal to the total prediction accuracy at the point prior to this point and for which the total prediction accuracy is greater than the total prediction accuracy at the point posterior to this point. Features in the feature set corresponding to this point were deemed to be significant for the problem addressed in this study.

Results and Discussion

Results of the mRMR and revised IFS methods

The gene expression profiles of three brain regions from six individuals were examined in this study, thereby comprising six datasets. The mRMR method was executed on each of these six datasets and yielded the MaxRel feature list and mRMR feature list. Due to the limitations of our computational power, we only required output of the first 500 features in each of the two lists. The obtained lists are provided in S1 Table.

Each of the six datasets was used as a training dataset, with the other five datasets used as test datasets. Thus, the revised IFS method described in Section “Revised IFS method” was executed six times. Each time, the IFS method constructed 500 feature sets according to the mRMR feature list. The features in each set were used to represent samples in the training dataset and five test datasets. The prediction engine SMO was trained on the training dataset, and its performance was evaluated on the five test datasets. For each test dataset, a series of total prediction accuracies and accuracies of three classes were obtained, all of which are provided in S2 Table. According to the third step of the revised IFS method, the predicted results of each test dataset can produce an IFS curve. Thus, for one training dataset and five test datasets, we can obtain five IFS curves. Six groups of IFS curves are illustrated in S1–S6 Figs. These curves show that for a given training dataset and test dataset, the feature set yielding the maximum value for total prediction accuracy on the test dataset can be obtained, and the numbers of features in these sets are listed in S3 Table. A 3-D histogram was plotted in Fig 1 to show the number of features in the feature set yielding the maximum total prediction accuracy and the corresponding total prediction accuracy.

Download:

Fig 1. A 3-D histogram illustrating the number of features in the feature set yielding the maximum total prediction accuracy and the corresponding total prediction accuracy.

The height of the bar represents the maximum total prediction accuracy, whereas the color of the bar represents the number of features in the feature set yielding the maximum total prediction accuracy.

https://doi.org/10.1371/journal.pone.0159395.g001

From S3 Table and Fig 1, we can see that the feature set yielding the maximum total prediction accuracy always contains a large number of features (genes), which makes it difficult to further analyze their importance. Thus, as mentioned in Section “Revised IFS method”, we selected an inflection point in each IFS curve. To do that, we amplified each IFS curve between 4 and 50 on the X-axis, as illustrated in Figs 2–7. The obtained inflection points and their corresponding total prediction accuracies are presented in S4 Table. Additionally, a 3-D histogram was plotted in Fig 8 to show the number of features of each inflection point and the corresponding total prediction accuracy. It can be observed from S2 Table and Fig 8 that the number of selected features (genes) at the inflection point was, in most cases, greatly reduced.

Download:

Fig 2. Parts of five IFS-curves using the data of H0351.1009 as the training dataset and the data from other people as the test dataset.