Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Integrated analysis of human DNA methylation, gene expression, and genomic variation in iMETHYL database using kernel tensor decomposition-based unsupervised feature extraction

  • Y-h. Taguchi ,

    Roles Conceptualization, Formal analysis, Investigation, Writing – original draft, Writing – review & editing

    tag@granular.com

    Affiliation Department of Physics, Chuo University, Tokyo, Japan

  • Shohei Komaki,

    Roles Conceptualization, Investigation, Writing – review & editing

    Affiliation Division of Biomedical Information Analysis, Iwate Tohoku Medical Megabank Organization, Disaster Reconstruction Center, Iwate Medical University, Iwate, Japan

  • Yoichi Sutoh,

    Roles Conceptualization, Investigation, Writing – review & editing

    Affiliation Division of Biomedical Information Analysis, Iwate Tohoku Medical Megabank Organization, Disaster Reconstruction Center, Iwate Medical University, Iwate, Japan

  • Hideki Ohmomo,

    Roles Conceptualization, Investigation, Writing – review & editing

    Affiliation Division of Biomedical Information Analysis, Iwate Tohoku Medical Megabank Organization, Disaster Reconstruction Center, Iwate Medical University, Iwate, Japan

  • Yayoi Otsuka-Yamasaki,

    Roles Conceptualization, Investigation, Writing – review & editing

    Affiliation Division of Biomedical Information Analysis, Iwate Tohoku Medical Megabank Organization, Disaster Reconstruction Center, Iwate Medical University, Iwate, Japan

  • Atsushi Shimizu

    Roles Conceptualization, Investigation, Writing – review & editing

    Affiliation Division of Biomedical Information Analysis, Iwate Tohoku Medical Megabank Organization, Disaster Reconstruction Center, Iwate Medical University, Iwate, Japan

Abstract

Integrating gene expression, DNA methylation, and genomic variants simultaneously without location coincidence (i.e., irrespective of distance from each other) or pairwise coincidence (i.e., direct identification of triplets of gene expression, DNA methylation, and genomic variants, and not integration of pairwise coincidences) is difficult. In this study, we integrated gene expression, DNA methylation, and genome variants from the iMETHYL database using the recently proposed kernel tensor decomposition-based unsupervised feature extraction method with limited computational resources (i.e., short CPU time and small memory requirements). Our methods do not require prior knowledge of the subjects because they are fully unsupervised in that unsupervised tensor decomposition is used. The selected genes and genomic variants were significantly targeted by transcription factors that were biologically enriched in KEGG pathway terms as well as in the intra-related regulatory network. The proposed method is promising for integrated analyses of gene expression, methylation, and genomic variants with limited computational resources.

Introduction

The integrated analysis of multiomics datasets has always been difficult; in particular, integrating gene expression, DNA methylation, and genetic variants has rarely been successful [1, 2]; in contrast, many studies integrate two of these three, that is, DNA methylation and genomic variants [35], gene expression and DNA methylation [69], and gene expression and genomic variants [10]. Although Seal et al. [2] successfully predicted gene expression from copy number variants (CNV) and DNA methylation, they did not discuss the relationship between CNV and DNA methylation. Therefore, they did not conduct a truly integrated analysis. Bell et al. [1] examined DNA methylation as a function of genetic and gene expression variation but did not directly investigate the relationship between gene expression and genetic variants; therefore, it was not a true integrated analysis.

In this study, we applied a recently proposed method [11] for the integrated analysis of gene expression, genetic variants, and DNA methylation using data retrieved from the iMETHYL database [12, 13], without assuming any causal relationship between them in the framework of a purely data-driven strategy. Gene expression, methylation, and genetic variation shared patient-dependent patterns and were regulated by transcription factors. Enrichment analysis based on the genes targeted by these transcription factors is largely related to various biological functions.

Materials and methods

Data set

The data set comprised gene expression, DNA methylation, and genomic variation profiles obtained from the same patients for each cell type (CD4 positive T cells: 99 patients, monocytes: 99 patients, neutrophils: 94 patients, for a total of 194 unique subjects. Venn diagram in Fig 1); 194 subjects common among these three measurements (i.e., gene expression, DNA methylation, and genomic variation) in one of three cell types were included in the analysis. The dataset analyzed in this study was obtained from the iMETHYL database after receiving approval from the Medical Ethics Committee of Iwate Medical University (approval no. HGH29-32) and the Ethics Committee of Chuo University (2019-6 and 2021-072).

thumbnail
Fig 1. Venn diagram of subjects in CD4T cells, monocytes, and neutrophils.

https://doi.org/10.1371/journal.pone.0289029.g001

Preprocessing

Fastq files obtained from RNA-seq were processed following the GTEx pipeline V8 [14] with slight modifications. Briefly, sequence reads were aligned to the GRCh37 human reference genome using STAR v2.5.0 [15], and bam files were generated.

Sequence reads obtained from whole-genome bisulfite sequencing were aligned using NovoAlign v3.02.08 (Novocraft Technologies, Sdn. Bhd., Selangor, Malaysia). The number of converted and unconverted cytosines mapped to each CpG was counted using NovoMethyl v3.02.08 (Novocraft Technologies), and the proportion of unconverted cytosines was calculated as the DNA methylation level (%) [16].

Whole-genome sequence data were obtained from the Tohoku Medical Megabank Project [17], in which sequence reads were mapped onto the CRCh37 human reference genome using BWA-MEM [18] and variant calls were carried out using GATK v3.7 [19]. The resultant VCF files were further converted into the 012 format, where numeric variables ranging from 0 to 2 represent the number of non-reference alleles.

For the gene expression profiles, the bam files were converted into bed files using the bamtobed command. For gene expression and DNA methylation profiles, the bed files were separately integrated (summed or averaged) over every 25,000 nucleotide intervals, separately for 22 individual autosomes. Hereafter, these intervals are denoted as “genomic regions.” Genetic variants were converted to numeric values (0–2) representing the number of non-reference alleles.

Tensor decomposition-based unsupervised feature extraction

We applied TD-based unsupervised FE optimized for multiomics data integration [11] to the dataset. Suppose that represents the values of the ikth components of the kth omics dataset for the jth subject. From these, we generated (1) HOSVD [20] was applied to xjjK resulting in (2) where is a core tensor that represents a weight of the product towards xjjk. and are singular value orthogonal matrices. when 1 = 2 and j = j′. After identifying of interest and denoting a set of these 1s as , we can derive the singular value vectors attributed to iks by (3) and attribute Pvalues to ik assuming that follows a Gaussian distribution (null hypothesis) (4) where [> x] is the cumulative χ2 distribution whose argument is larger than x and is the standard deviation. s were corrected using the BH criterion [20] and iks with an adjusted of less than 0.01 were selected.

Identification of genes associated with selected genomic regions

Genes included in the genomic regions selected by KTD-based unsupervised FE were identified using biomaRt [21] package in R [22] for the hg19 genome.

Enrichment analysis

Enrichment analysis was performed using Enrichr software [23].

Transcription factor regulation analysis

Information on TF mutual regulation relations was retrieved from Regnetworkweb [24] and TRRUST2 [25].

Identification of TFBSs and genes associated with detected genetic variants

TFBSs and genes associated with genetic variation were identified using SNPnexus [26].

Results

Fig 2 presents a flowchart of the analyses.

thumbnail
Fig 2. Flowchart of the analyses for CD4T cells, neutrophils, and monocytes.

Gene expression and DNA methylation profiles were averaged over 25,000 nucleotide regions. Gene expression, DNA methylation and genomic variants were multiplied by themselves to obtain square matrices that are bundled into a tensor; tensor decomposition was then applied. The obtained were compared with clinical information and used to compute to select regions (“max correlated set” (1 = 2) and “clinically correlated set” (1 = 3) are identified). For gene expression and DNA methylation, TFs that target genes included in the identified regions were selected and validated with enrichment analyses and comparisons with TTRUST2 and Regnetwork. Identified TFs were also compared with TFBSs identified by the selected genomic variants.

https://doi.org/10.1371/journal.pone.0289029.g002

Identification of s of interest

Since this study included only healthy individuals, we could not identify differentially expressed genes (DEG). Therefore, we employed a fully unsupervised strategy. Defining DEG is difficult using this strategy. Our criterion was as follows: seek s that are common among distinct individual autosomes. If the 22 s identified for individual autosomes share the same subject dependence j, it is unlikely to be accidental. Although there is some possibility that they reflect measurement bias; for example, if the total number of reads differs from subject to subject, this is very unlikely to be caused by measurement bias for the following reasons. First, as the present study was an integrated analysis of three omics measurements, the same patterns of subject measurement bias were unlikely to occur for the three omics datasets simultaneously because the experimental procedures differed from each other. Second, as are orthogonal to each other for distinct 1s, if more than two patterns of subject dependence are observed for more than one s, none of them can be interpreted as measurement bias, which can result in only one unique pattern of subject dependence. Third, if subject-dependent patterns are caused by measurement bias, the selected genes based on these patterns may not be biologically reasonable; however, by applying enrichment analysis, the biological significance of the selected genes is easily validated.

Table 1 lists the average absolute mutual correlation coefficients between the patterns attributed to 22 autosomes: (5) where is the correlation coefficient between and and is selected for chr-th autosome (1 ≤ chr ≤ 22). These factors are mutually correlated. To further validate the significance of the number of pairs among the total 22 × 21/2 = 231 pairs, we computed P-values and counted the number of pairs associated with significant correlations. Most pairs were significantly correlated (Table 2). Hereafter this set of is denoted as the “max correlated set.” Next, we attempted to determine whether s in the “max correlated set” correlated with the clinical data (Table 3). Unfortunately 1s selected most frequently in Table 1 for “max correlated set,” s with 1 = 2, are not correlated with clinical data. Thus, we decided to select s with 1 = 3 for CD4 + T cells and neutrophils as additional singular value vectors of interest (Those for monocytes were not selected because they did not correlate with the clinical data in Table 3). Table 1 also shows the mutual correlation coefficients between the patterns attributed to the 22 autosomes and the associated and corrected P-values for s with 1 = 3. Although the correlations were less than those in the “max correlated set,” they were more or less significant (Table 1), as the majority of the 231 pairs were significantly correlated (Table 3). Thus, we decided to employ s with 1 = 3 for the downstream analyses. Hereafter, these sets with are denoted as “clinically correlated sets.”

thumbnail
Table 2. Number of significantly correlated pairs among all 231 pairs.

https://doi.org/10.1371/journal.pone.0289029.t002

Selection of genomic regions and variants, and their biological validation

Following the described procedures, genomic regions and variants were identified together with the included/associated genes for genomic regions and variants using biomaRt and SNPnexus, respectively. Transcription factor-binding sites (TFBSs) associated with genomic variants were also identified using SNPnexus. After collecting the genes identified for individual autosomes, Enrichr was used to identify TFs that targeted genes included in the genomic regions and validate their biological significance by applying KTD-based unsupervised FE.

“Max correlated set”: CD4 T cells.

We identified 221 and 536 genomic regions for gene expression and DNA methylation, respectively, as well as 1,174,607 genomic variants that were supposed to coincide with the subject profiles represented by listed in the column “CD4 T Cells” under the “Max correlated set” in Table 1. A total of 419 and 590 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively. A total of 14,346 genes were associated with genomic variants. By uploading 419 genes to Enrichr for gene expression analysis, 26 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA & ENCODE consensus” category (Table 4). To validate their biological significance, these 26 TFs were uploaded to Enrichr and found to form a biologically significant set (Table 5). Furthermore, 25 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA 2016” category by uploading 590 genes for DNA methylation to Enrichr (Table 6). These 25 TFs formed a biologically significant set (Table 7).

thumbnail
Table 4. TFs for genes identified by gene expression in “ENCODE and ChEA Consensus TFs from ChIP-X”.

https://doi.org/10.1371/journal.pone.0289029.t004

thumbnail
Table 5. KEGG Human 2019 (for TFs listed in the “CD4 T cells” category under the “Max correlated set” of Table 4).

https://doi.org/10.1371/journal.pone.0289029.t005

thumbnail
Table 6. TFs for genes identified by DNA methylation in “ChEA 2016”.

https://doi.org/10.1371/journal.pone.0289029.t006

thumbnail
Table 7. KEGG Human 2019 (for TFs listed in the “CD4 T cells” category under the “Max correlated set” of Table 6).

https://doi.org/10.1371/journal.pone.0289029.t007

“Max correlated set”: Neutrophils.

We identified 356 and 154 genomic regions for gene expression and DNA methylation, respectively, and 778,698 genomic variants supposed to be coincident with the subject profiles represented by listed in the column “Neutrophils” under “Max correlated set” in Table 1. A total of 490 and 500 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively. Furthermore, 15,356 genes were associated with genomic variants. Of the 490 genes involved in gene expression uploaded to Enrichr, 17 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA & ENCODE consensus” category (Table 4). These 17 TFs formed a biologically significant set (Table 8). Furthermore, by uploading 500 genes identified for DNA methylation, 33 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA 2016” category (Table 6). These 33 TFs formed a biologically significant set (Table 9).

thumbnail
Table 8. KEGG Human 2019 (for TFs listed in the “Neutrophils” category under the “Max correlated set” of Table 4).

https://doi.org/10.1371/journal.pone.0289029.t008

thumbnail
Table 9. KEGG Human 2019 (for TFs listed in the “Neutrophils” category under the “Max correlated set” of Table 6).

https://doi.org/10.1371/journal.pone.0289029.t009

“Max correlated set”: Monocytes.

We identified 182 and 558 genomic regions for gene expression and DNA methylation, respectively, as well as 1,105,748 genomic variants that were supposed to coincide with the subject profiles represented by in the column “Monocytes” under the “Max correlated set” in Table 1. In total, 453 and 1,015 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively. Furthermore, 14,032 genes were associated with genomic variants. Twenty-four TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA & ENCODE consensus” category by uploading 182 genes identified for gene expression (Table 4). These 24 TFs formed a biologically significant set (Table 10). Furthermore, 30 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA 2016” category by uploading 558 genes for DNA methylation (Table 6). These 30 TFs formed a biologically significant set (Table 11).

thumbnail
Table 10. KEGG Human 2019 (for TFs listed in “Monocytes” category under “Max correlated set” of Table 4).

https://doi.org/10.1371/journal.pone.0289029.t010

thumbnail
Table 11. KEGG Human 2019 (for TFs listed in the “Monocytes” category under the “Max correlated set” of Table 6).

https://doi.org/10.1371/journal.pone.0289029.t011

“Clinically correlated set”: CD4 T cells.

We identified 425 and 281 genomic regions for gene expression and DNA methylation, respectively as well as 1,073,649 genomic variants that are supposed to coincide with the subject profiles represented by in the column “CD4 T cell” under “Clinically correlated set” in Table 1. In total, 794 and 412 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively. Furthermore, 13,178 genes were associated with genomic variants. After uploading 794 genes for gene expression, 36 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA & ENCODE consensus” category (Table 4). These 36 TFs formed a biologically significant set (Table 12). Furthermore, 18 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA 2016” category by uploading 412 genes for DNA methylation to Enrichr, (Table 6). These 18 TFs formed a biologically significant set (Table 13).

thumbnail
Table 12. KEGG Human 2019 (for TFs listed in the “CD4 T cells” category under the “Clinically correlated set” of Table 4).

https://doi.org/10.1371/journal.pone.0289029.t012

thumbnail
Table 13. KEGG Human 2019 (for TFs listed in the “CD4 T cells” category under the “Clinically correlated set” of Table 6).

https://doi.org/10.1371/journal.pone.0289029.t013

“Clinically correlated set”: Neutrophils.

We identified 380 and 541 genomic regions for gene expression and DNA methylation, respectively, as well as 63,894 genomic variants that are supposed to coincide with subject profiles represented by in the column “Neutrophils” under the “Clinically correlated set” in Table 1. A total of 610 and 499 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively. Furthermore, 3,292 genes were associated with genomic variants. By uploading the 610 genes identified for gene expression to Enrichr, 26 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA & ENCODE consensus” category (Table 4). These 26 TFs formed a biologically significant set (Table 14). Furthermore, by uploading 499 genes for DNA methylation, 30 TFs with threshold adjusted P-values less than 0.05 were identified in the “ChEA 2016” category (Table 6). These 30 TFs formed a biologically significant set (Table 15).

thumbnail
Table 14. KEGG Human 2019 (for TFs listed in the “Neutrophils” category under the “Clinically correlated set” of Table 4).

https://doi.org/10.1371/journal.pone.0289029.t014

thumbnail
Table 15. KEGG Human 2019 (for TFs listed in “CD4 T cells” category under the “Clinically correlated set” of Table 6).

https://doi.org/10.1371/journal.pone.0289029.t015

Discussion

The selected genes were targeted by various TFs enriched in KEGG pathways; thus, genes with profiles coincident with the patient profiles expressed by the selected singular value vectors were biologically valid. The selected KEGG pathways were likely to express the biological properties of the participants’ blood cells. If they are intraregulated, the identified genes and variants are probably effective. To determine whether the identified genes were intraregulated, we uploaded the selected TFs to two databases that validated the regulatory relationships between TFs: Regnetworkweb and TRRUST2. Regnetworkweb considers only direct regulatory relationships between TFs whereas TRRUST2 considers indirect regulatory relationships; for example, two TFs targeting the same genes (Figs 312). These are clearly highly intracorrelated. Thus, in terms of regulatory relationships, the identified TFs are reasonable.

thumbnail
Fig 3. Regulatory network between TFs in the “CD4 T cells category” under the “Max correlated set” in Table 4.

Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 4. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g003

thumbnail
Fig 4. Regulatory network between TFs in the “CD4 T cells category” under the “Max correlated set” in Table 6.

Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 6. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g004

thumbnail
Fig 5. Regulatory network between TFs in “Neutrophils” under the “Max correlated set” in Table 4.

Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 4. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g005

thumbnail
Fig 6. Regulatory network between TFs in “Neutrophils” under the “Max correlated set” in Table 6.

Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 6. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g006

thumbnail
Fig 7. Regulatory network between TFs in “Monocytes” under the “Max correlated set” in Table 4.

Upper: Regnetworkweb, lower: TTRUST2. Blue genes in Regnetworkweb and yellow genes in TTRUST2 are TFs in Table 4. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g007

thumbnail
Fig 8. Regulatory network between TFs in “Monocytes” under the “Max correlated set” in Table 6.

Upper: Regnetworkweb, lower: TTRUST2. Blue genes in Regnetworkweb and yellow genes in TTRUST2 are TFs in Table 6. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g008

thumbnail
Fig 9. Regulatory network between TFs in “CD4 T cells” under the “Clinically correlated set” in Table 4.

Upper: Regnetworkweb, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 4. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g009

thumbnail
Fig 10. Regulatory network between TFs in “CD4 T cells” under the “Clinically correlated set” in Table 6.

Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 6. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g010

thumbnail
Fig 11. Regulatory network between TFs in “Neutrophils” under the “Clinically correlated set” in Table 4.

Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 4. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g011

thumbnail
Fig 12. Regulatory network between TFs in “Neutrophils” under the “Clinically correlated set” in Table 6.

Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 6. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g012

To validate the overlap between TFs that target genes identified based on gene expression or DNA methylation and those identified based on TFBS, the total number of human TFs must be determined, and we assume that it is approximately 2000 [27]. Table 16 shows the results of Fisher’s test. TFs identified for gene expression significantly overlapped with TFBS associated with genomic variants, even though no significant overlaps were found between the TFBS identified using genomic variants and TFs identified for methylation (not shown here).

thumbnail
Table 16. Fisher’s exact tests between TFs in Table 4 and those identified through the TFBSs of genomic variants detected by KTD based unsupervised FE.

https://doi.org/10.1371/journal.pone.0289029.t016

One might wonder why we did not compare our performance with that of existing methods. To the best of our knowledge, no other methods are comparable to ours. First, our analysis of association studies between gene expression, methylation, and genomic variants is free from location restrictions; this method can detect any kind of association between these genes, independent of their location along the genome. For example, we can identify interactions between genes and genomic variants that are distant from each other. This is because we can derive the singular value vectors attributed to the subjects, , at the very beginning of the data analysis flow just after applying TD to xjjk. Genomic regions and/or variants were then selected based on singular value vectors attributed to genomic regions or variants ik. Application of TD to xjjk requires a very small amount of computational resources, as . To our knowledge, no other method can select genomic regions and variants using such a small amount of computational resources. In particular, treating genomic variants is difficult. For gene expression and methylation, these values can be averaged within individual genomic regions, resulting in a reduced dimension of ik, that is, Nk. Nevertheless, this cannot be performed for genomic variants because the integers (1, 2, and 3) derived from the genomic variants are arbitrary. Averaging distinct integer numbers attributed to individual genomic variants can destroy the meaning of these integers. Despite this, our method is independent of the size of Nk, and can be applied to genomic variants as is. To the best of our knowledge, no other methods can perform this task; thus, we could not compare the performance of our method with that of any other method.

Our methods do not require prior knowledge of the subjects. Singular value vectors attributed to the subjects, , can be generated by applying TD to xjjM, which does not require any additional information about the subjects. The selection of used to select iK was based on the coincidence between those computed for individual autosomes. Therefore, genomic regions and variants can be selected in a fully unsupervised manner. However, the selected genes were significantly targeted by multiple TFs that were enriched in KEGG pathway terms.

Several biological insights were obtained from this population-based study. One possible application to clinical studies is to compare the outcomes of the present study with those of other clinical studies. Generally, both population- and clinical-based studies have their own biases, and by comparing their outcomes with each other, we can validate their outcomes, which is impossible when only individual outcomes are present.

However, this method had several limitations. First, it is applicable to multiomics datasets that share samples. In addition, because this is an unsupervised method, if there are no significant results in the downward analyses, we have no way to improve the results.

References

  1. 1. Bell JT, Pai AA, Pickrell JK, Gaffney DJ, Pique-Regi R, Degner JF, et al. DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biology. 2011;12(1). pmid:21251332
  2. 2. Seal DB, Das V, Goswami S, De RK. Estimating gene expression from DNA methylation and copy number variation: A deep learning regression model for multi-omics integration. Genomics. 2020;112(4):2833–2841. pmid:32234433
  3. 3. Wang H, Lou D, Wang Z. Crosstalk of Genetic Variants, Allele-Specific DNA Methylation, and Environmental Factors for Complex Disease Risk. Frontiers in Genetics. 2019;9:695. pmid:30687383
  4. 4. Shi X, Radhakrishnan S, Wen J, Chen JY, Chen J, Lam BA, et al. Association of CNVs with methylation variation. npj Genomic Medicine. 2020;5(1). pmid:33062306
  5. 5. Roudbar MA, Mohammadabadi MR, Mehrgardi AA, Abdollahi-Arpanahi R, Momen M, Morota G, et al. Integration of single nucleotide variants and whole-genome DNA methylation profiles for classification of rheumatoid arthritis cases from controls. Heredity. 2020;124(5):658–674.
  6. 6. Lea AJ, Vockley CM, Johnston RA, Del Carpio CA, Barreiro LB, Reddy TE, et al. Genome-wide quantification of the effects of DNA methylation on human gene regulation. eLife. 2018;7:e37513. pmid:30575519
  7. 7. Natri HM, Bobowik KS, Kusuma P, Crenna Darusallam C, Jacobs GS, Hudjashov G, et al. Genome-wide DNA methylation and gene expression patterns reflect genetic ancestry and environmental differences across the Indonesian archipelago. PLOS Genetics. 2020;16(5):1–21. pmid:32453742
  8. 8. Blake LE, Roux J, Hernando-Herraez I, Banovich NE, Perez RG, Hsiao CJ, et al. A comparison of gene expression and DNA methylation patterns across tissues and species. Genome Research. 2020;30(2):250–262. pmid:31953346
  9. 9. Alakärppä E, Salo HM, Valledor L, Cañal MJ, Häggman H, Vuosku J. Natural variation of DNA methylation and gene expression may determine local adaptations of Scots pine populations. Journal of Experimental Botany. 2018;69(21):5293–5305. pmid:30113688
  10. 10. Franke L, Jansen RC. eQTL Analysis in Humans. In: Methods in Molecular Biology. Humana Press; 2009. p. 311–328. Available from: https://doi.org/10.1007/978-1-60761-247-6_17.
  11. 11. Taguchi YH, Turki T. Novel feature selection method via kernel tensor decomposition for improved multi-omics data analysis. BMC Medical Genomics. 2022;15(1). pmid:35209912
  12. 12. Komaki S, Shiwa Y, Furukawa R, Hachiya T, Ohmomo H, Otomo R, et al. iMETHYL: an integrative database of human DNA methylation, gene expression, and genomic variation. Human Genome Variation. 2018;5(1). pmid:29619235
  13. 13. ;. Available from: http://imethyl.iwate-megabank.org/.
  14. 14. Analysis pipelines for the GTEx Consortium and TOPMed;. https://github.com/broadinstitute/gtex-pipeline.
  15. 15. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2012;29(1):15–21. pmid:23104886
  16. 16. Hachiya T, Furukawa R, Shiwa Y, Ohmomo H, Ono K, Katsuoka F, et al. Genome-wide identification of inter-individually variable DNA methylation sites improves the efficacy of epigenetic association studies. npj Genomic Medicine. 2017;2(1). pmid:29263827
  17. 17. Tadaka S, Katsuoka F, Ueki M, Kojima K, Makino S, et al. 3.5KJPNv2: an allele frequency panel of 3552 Japanese individuals including the X chromosome. Human Genome Variation. 2019;6(1). pmid:31240104
  18. 18. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM; 2013.
  19. 19. Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2018;
  20. 20. Taguchi YH. Unsupervised Feature Extraction Applied to Bioinformatics. Springer International Publishing; 2020. Available from: https://doi.org/10.1007/978-3-030-22456-1.
  21. 21. Durinck S, Spellman PT, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature Protocols. 2009;4(8):1184–1191. pmid:19617889
  22. 22. R Core Team. R: A Language and Environment for Statistical Computing; 2019. Available from: https://www.R-project.org/.
  23. 23. Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Research. 2016;44(W1):W90–W97. pmid:27141961
  24. 24. Liu ZP, Wu C, Miao H, Wu H. RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse. Database. 2015;2015. pmid:26424082
  25. 25. Han H, Cho JW, Lee S, Yun A, Kim H, Bae D, et al. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Research. 2017;46(D1):D380–D386.
  26. 26. Oscanoa J, Sivapalan L, Gadaleta E, Dayem Ullah AZ, Lemoine NR, Chelala C. SNPnexus: a web server for functional annotation of human genome sequence variation (2020 update). Nucleic Acids Research. 2020;48(W1):W185–W192. pmid:32496546
  27. 27. Brivanlou AH, Darnell JE. Signal Transduction and the Control of Gene Expression. Science. 2002;295(5556):813–818. pmid:11823631