Integrated analysis of human DNA methylation, gene expression, and genomic variation in iMETHYL database using kernel tensor decomposition-based unsupervised feature extraction

Y-h. Taguchi; Shohei Komaki; Yoichi Sutoh; Hideki Ohmomo; Yayoi Otsuka-Yamasaki; Atsushi Shimizu

doi:10.1371/journal.pone.0289029

Abstract

Integrating gene expression, DNA methylation, and genomic variants simultaneously without location coincidence (i.e., irrespective of distance from each other) or pairwise coincidence (i.e., direct identification of triplets of gene expression, DNA methylation, and genomic variants, and not integration of pairwise coincidences) is difficult. In this study, we integrated gene expression, DNA methylation, and genome variants from the iMETHYL database using the recently proposed kernel tensor decomposition-based unsupervised feature extraction method with limited computational resources (i.e., short CPU time and small memory requirements). Our methods do not require prior knowledge of the subjects because they are fully unsupervised in that unsupervised tensor decomposition is used. The selected genes and genomic variants were significantly targeted by transcription factors that were biologically enriched in KEGG pathway terms as well as in the intra-related regulatory network. The proposed method is promising for integrated analyses of gene expression, methylation, and genomic variants with limited computational resources.

Figures

Citation: Taguchi Y-h, Komaki S, Sutoh Y, Ohmomo H, Otsuka-Yamasaki Y, Shimizu A (2023) Integrated analysis of human DNA methylation, gene expression, and genomic variation in iMETHYL database using kernel tensor decomposition-based unsupervised feature extraction. PLoS ONE 18(8): e0289029. https://doi.org/10.1371/journal.pone.0289029

Editor: Turki Talal Turki, King Abdulaziz University, SAUDI ARABIA

Received: December 21, 2022; Accepted: July 7, 2023; Published: August 9, 2023

Copyright: © 2023 Taguchi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Summary statistics of DNA methylation, gene expression, and genomic variation are available from the NBDC database (https://humandbs.biosciencedbc.jp/en/hum0056-v1). Individual-level data cannot be made publicly available to protect the participants’ privacy but is available upon approval of an application to the Tohoku Medical Megabank Project (https://www.megabank.tohoku.ac.jp/english/; http://iwate-megabank.org/en/).

Funding: This study was partially funded by the Tohoku Medical Megabank project, supported by the Ministry of Education, Culture, Sports, Sciences, and Technology of the Japanese government and the Japan Agency for Medical Research and Development. This study was also supported by KAKENHI, [grant numbers 19H05270, 20H04848, and 20K12067] to YHT. The super computer resource (powered by AMED research grant JP20km0405001) was provided by Tohoku Medical Megabank Organization, Tohoku University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The integrated analysis of multiomics datasets has always been difficult; in particular, integrating gene expression, DNA methylation, and genetic variants has rarely been successful [1, 2]; in contrast, many studies integrate two of these three, that is, DNA methylation and genomic variants [3–5], gene expression and DNA methylation [6–9], and gene expression and genomic variants [10]. Although Seal et al. [2] successfully predicted gene expression from copy number variants (CNV) and DNA methylation, they did not discuss the relationship between CNV and DNA methylation. Therefore, they did not conduct a truly integrated analysis. Bell et al. [1] examined DNA methylation as a function of genetic and gene expression variation but did not directly investigate the relationship between gene expression and genetic variants; therefore, it was not a true integrated analysis.

In this study, we applied a recently proposed method [11] for the integrated analysis of gene expression, genetic variants, and DNA methylation using data retrieved from the iMETHYL database [12, 13], without assuming any causal relationship between them in the framework of a purely data-driven strategy. Gene expression, methylation, and genetic variation shared patient-dependent patterns and were regulated by transcription factors. Enrichment analysis based on the genes targeted by these transcription factors is largely related to various biological functions.

Materials and methods

Data set

The data set comprised gene expression, DNA methylation, and genomic variation profiles obtained from the same patients for each cell type (CD4 positive T cells: 99 patients, monocytes: 99 patients, neutrophils: 94 patients, for a total of 194 unique subjects. Venn diagram in Fig 1); 194 subjects common among these three measurements (i.e., gene expression, DNA methylation, and genomic variation) in one of three cell types were included in the analysis. The dataset analyzed in this study was obtained from the iMETHYL database after receiving approval from the Medical Ethics Committee of Iwate Medical University (approval no. HGH29-32) and the Ethics Committee of Chuo University (2019-6 and 2021-072).

Download:

Fig 1. Venn diagram of subjects in CD4T cells, monocytes, and neutrophils.

https://doi.org/10.1371/journal.pone.0289029.g001

Preprocessing

Fastq files obtained from RNA-seq were processed following the GTEx pipeline V8 [14] with slight modifications. Briefly, sequence reads were aligned to the GRCh37 human reference genome using STAR v2.5.0 [15], and bam files were generated.

Sequence reads obtained from whole-genome bisulfite sequencing were aligned using NovoAlign v3.02.08 (Novocraft Technologies, Sdn. Bhd., Selangor, Malaysia). The number of converted and unconverted cytosines mapped to each CpG was counted using NovoMethyl v3.02.08 (Novocraft Technologies), and the proportion of unconverted cytosines was calculated as the DNA methylation level (%) [16].

Whole-genome sequence data were obtained from the Tohoku Medical Megabank Project [17], in which sequence reads were mapped onto the CRCh37 human reference genome using BWA-MEM [18] and variant calls were carried out using GATK v3.7 [19]. The resultant VCF files were further converted into the 012 format, where numeric variables ranging from 0 to 2 represent the number of non-reference alleles.

For the gene expression profiles, the bam files were converted into bed files using the bamtobed command. For gene expression and DNA methylation profiles, the bed files were separately integrated (summed or averaged) over every 25,000 nucleotide intervals, separately for 22 individual autosomes. Hereafter, these intervals are denoted as “genomic regions.” Genetic variants were converted to numeric values (0–2) representing the number of non-reference alleles.

Tensor decomposition-based unsupervised feature extraction

We applied TD-based unsupervised FE optimized for multiomics data integration [11] to the dataset. Suppose that represents the values of the i_kth components of the kth omics dataset for the jth subject. From these, we generated (1) HOSVD [20] was applied to x_jj′K resulting in (2) where is a core tensor that represents a weight of the product towards x_jj′k. and are singular value orthogonal matrices. when ℓ₁ = ℓ₂ and j = j′. After identifying of interest and denoting a set of these ℓ₁s as , we can derive the singular value vectors attributed to i_ks by (3) and attribute Pvalues to i_k assuming that follows a Gaussian distribution (null hypothesis) (4) where [> x] is the cumulative χ² distribution whose argument is larger than x and is the standard deviation. s were corrected using the BH criterion [20] and i_ks with an adjusted of less than 0.01 were selected.

Identification of genes associated with selected genomic regions

Genes included in the genomic regions selected by KTD-based unsupervised FE were identified using biomaRt [21] package in R [22] for the hg19 genome.

Enrichment analysis

Enrichment analysis was performed using Enrichr software [23].

Transcription factor regulation analysis

Information on TF mutual regulation relations was retrieved from Regnetworkweb [24] and TRRUST2 [25].

Identification of TFBSs and genes associated with detected genetic variants

TFBSs and genes associated with genetic variation were identified using SNPnexus [26].

Results

Fig 2 presents a flowchart of the analyses.

Download:

Fig 2. Flowchart of the analyses for CD4T cells, neutrophils, and monocytes.

Gene expression and DNA methylation profiles were averaged over 25,000 nucleotide regions. Gene expression, DNA methylation and genomic variants were multiplied by themselves to obtain square matrices that are bundled into a tensor; tensor decomposition was then applied. The obtained were compared with clinical information and used to compute to select regions (“max correlated set” (ℓ₁ = 2) and “clinically correlated set” (ℓ₁ = 3) are identified). For gene expression and DNA methylation, TFs that target genes included in the identified regions were selected and validated with enrichment analyses and comparisons with TTRUST2 and Regnetwork. Identified TFs were also compared with TFBSs identified by the selected genomic variants.

https://doi.org/10.1371/journal.pone.0289029.g002

Identification of s of interest

Since this study included only healthy individuals, we could not identify differentially expressed genes (DEG). Therefore, we employed a fully unsupervised strategy. Defining DEG is difficult using this strategy. Our criterion was as follows: seek s that are common among distinct individual autosomes. If the 22 s identified for individual autosomes share the same subject dependence j, it is unlikely to be accidental. Although there is some possibility that they reflect measurement bias; for example, if the total number of reads differs from subject to subject, this is very unlikely to be caused by measurement bias for the following reasons. First, as the present study was an integrated analysis of three omics measurements, the same patterns of subject measurement bias were unlikely to occur for the three omics datasets simultaneously because the experimental procedures differed from each other. Second, as are orthogonal to each other for distinct ℓ₁s, if more than two patterns of subject dependence are observed for more than one ℓs, none of them can be interpreted as measurement bias, which can result in only one unique pattern of subject dependence. Third, if subject-dependent patterns are caused by measurement bias, the selected genes based on these patterns may not be biologically reasonable; however, by applying enrichment analysis, the biological significance of the selected genes is easily validated.

Table 1 lists the average absolute mutual correlation coefficients between the patterns attributed to 22 autosomes: (5) where is the correlation coefficient between and and is selected for chr-th autosome (1 ≤ chr ≤ 22). These factors are mutually correlated. To further validate the significance of the number of pairs among the total 22 × 21/2 = 231 pairs, we computed P-values and counted the number of pairs associated with significant correlations. Most pairs were significantly correlated (Table 2). Hereafter this set of is denoted as the “max correlated set.” Next, we attempted to determine whether s in the “max correlated set” correlated with the clinical data (Table 3). Unfortunately ℓ₁s selected most frequently in Table 1 for “max correlated set,” s with ℓ₁ = 2, are not correlated with clinical data. Thus, we decided to select s with ℓ₁ = 3 for CD4 + T cells and neutrophils as additional singular value vectors of interest (Those for monocytes were not selected because they did not correlate with the clinical data in Table 3). Table 1 also shows the mutual correlation coefficients between the patterns attributed to the 22 autosomes and the associated and corrected P-values for s with ℓ₁ = 3. Although the correlations were less than those in the “max correlated set,” they were more or less significant (Table 1), as the majority of the 231 pairs were significantly correlated (Table 3). Thus, we decided to employ s with ℓ₁ = 3 for the downstream analyses. Hereafter, these sets with are denoted as “clinically correlated sets.”

Download:

Table 1. Averaged mutual correlations between

s.

https://doi.org/10.1371/journal.pone.0289029.t001

Download:

Table 2. Number of significantly correlated pairs among all 231 pairs.

https://doi.org/10.1371/journal.pone.0289029.t002

Download:

Table 3. Correlation between clinical data and

.

https://doi.org/10.1371/journal.pone.0289029.t003

Selection of genomic regions and variants, and their biological validation

Following the described procedures, genomic regions and variants were identified together with the included/associated genes for genomic regions and variants using biomaRt and SNPnexus, respectively. Transcription factor-binding sites (TFBSs) associated with genomic variants were also identified using SNPnexus. After collecting the genes identified for individual autosomes, Enrichr was used to identify TFs that targeted genes included in the genomic regions and validate their biological significance by applying KTD-based unsupervised FE.

“Max correlated set”: CD4 T cells.

We identified 221 and 536 genomic regions for gene expression and DNA methylation, respectively, as well as 1,174,607 genomic variants that were supposed to coincide with the subject profiles represented by listed in the column “CD4 T Cells” under the “Max correlated set” in Table 1. A total of 419 and 590 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively. A total of 14,346 genes were associated with genomic variants. By uploading 419 genes to Enrichr for gene expression analysis, 26 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA & ENCODE consensus” category (Table 4). To validate their biological significance, these 26 TFs were uploaded to Enrichr and found to form a biologically significant set (Table 5). Furthermore, 25 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA 2016” category by uploading 590 genes for DNA methylation to Enrichr (Table 6). These 25 TFs formed a biologically significant set (Table 7).

Download:

Table 4. TFs for genes identified by gene expression in “ENCODE and ChEA Consensus TFs from ChIP-X”.

https://doi.org/10.1371/journal.pone.0289029.t004

Download:

Table 5. KEGG Human 2019 (for TFs listed in the “CD4 T cells” category under the “Max correlated set” of Table 4).

https://doi.org/10.1371/journal.pone.0289029.t005

Download:

Table 6. TFs for genes identified by DNA methylation in “ChEA 2016”.

https://doi.org/10.1371/journal.pone.0289029.t006

Download:

Table 7. KEGG Human 2019 (for TFs listed in the “CD4 T cells” category under the “Max correlated set” of Table 6).

https://doi.org/10.1371/journal.pone.0289029.t007

“Max correlated set”: Neutrophils.

We identified 356 and 154 genomic regions for gene expression and DNA methylation, respectively, and 778,698 genomic variants supposed to be coincident with the subject profiles represented by listed in the column “Neutrophils” under “Max correlated set” in Table 1. A total of 490 and 500 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively. Furthermore, 15,356 genes were associated with genomic variants. Of the 490 genes involved in gene expression uploaded to Enrichr, 17 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA & ENCODE consensus” category (Table 4). These 17 TFs formed a biologically significant set (Table 8). Furthermore, by uploading 500 genes identified for DNA methylation, 33 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA 2016” category (Table 6). These 33 TFs formed a biologically significant set (Table 9).

Download:

Table 8. KEGG Human 2019 (for TFs listed in the “Neutrophils” category under the “Max correlated set” of Table 4).

https://doi.org/10.1371/journal.pone.0289029.t008

Download:

Table 9. KEGG Human 2019 (for TFs listed in the “Neutrophils” category under the “Max correlated set” of Table 6).

https://doi.org/10.1371/journal.pone.0289029.t009

“Max correlated set”: Monocytes.

We identified 182 and 558 genomic regions for gene expression and DNA methylation, respectively, as well as 1,105,748 genomic variants that were supposed to coincide with the subject profiles represented by in the column “Monocytes” under the “Max correlated set” in Table 1. In total, 453 and 1,015 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively. Furthermore, 14,032 genes were associated with genomic variants. Twenty-four TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA & ENCODE consensus” category by uploading 182 genes identified for gene expression (Table 4). These 24 TFs formed a biologically significant set (Table 10). Furthermore, 30 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA 2016” category by uploading 558 genes for DNA methylation (Table 6). These 30 TFs formed a biologically significant set (Table 11).

Download:

Table 10. KEGG Human 2019 (for TFs listed in “Monocytes” category under “Max correlated set” of Table 4).

https://doi.org/10.1371/journal.pone.0289029.t010

Download:

Table 11. KEGG Human 2019 (for TFs listed in the “Monocytes” category under the “Max correlated set” of Table 6).

https://doi.org/10.1371/journal.pone.0289029.t011

“Clinically correlated set”: CD4 T cells.

We identified 425 and 281 genomic regions for gene expression and DNA methylation, respectively as well as 1,073,649 genomic variants that are supposed to coincide with the subject profiles represented by in the column “CD4 T cell” under “Clinically correlated set” in Table 1. In total, 794 and 412 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively. Furthermore, 13,178 genes were associated with genomic variants. After uploading 794 genes for gene expression, 36 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA & ENCODE consensus” category (Table 4). These 36 TFs formed a biologically significant set (Table 12). Furthermore, 18 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA 2016” category by uploading 412 genes for DNA methylation to Enrichr, (Table 6). These 18 TFs formed a biologically significant set (Table 13).

Download:

Table 12. KEGG Human 2019 (for TFs listed in the “CD4 T cells” category under the “Clinically correlated set” of Table 4).

https://doi.org/10.1371/journal.pone.0289029.t012

Download:

Table 13. KEGG Human 2019 (for TFs listed in the “CD4 T cells” category under the “Clinically correlated set” of Table 6).

https://doi.org/10.1371/journal.pone.0289029.t013

“Clinically correlated set”: Neutrophils.

We identified 380 and 541 genomic regions for gene expression and DNA methylation, respectively, as well as 63,894 genomic variants that are supposed to coincide with subject profiles represented by in the column “Neutrophils” under the “Clinically correlated set” in Table 1. A total of 610 and 499 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively. Furthermore, 3,292 genes were associated with genomic variants. By uploading the 610 genes identified for gene expression to Enrichr, 26 TFs with threshold-adjusted P-values less than 0.05 were identified in the “ChEA & ENCODE consensus” category (Table 4). These 26 TFs formed a biologically significant set (Table 14). Furthermore, by uploading 499 genes for DNA methylation, 30 TFs with threshold adjusted P-values less than 0.05 were identified in the “ChEA 2016” category (Table 6). These 30 TFs formed a biologically significant set (Table 15).

Download:

Table 14. KEGG Human 2019 (for TFs listed in the “Neutrophils” category under the “Clinically correlated set” of Table 4).

https://doi.org/10.1371/journal.pone.0289029.t014

Download:

Table 15. KEGG Human 2019 (for TFs listed in “CD4 T cells” category under the “Clinically correlated set” of Table 6).

https://doi.org/10.1371/journal.pone.0289029.t015

Discussion

The selected genes were targeted by various TFs enriched in KEGG pathways; thus, genes with profiles coincident with the patient profiles expressed by the selected singular value vectors were biologically valid. The selected KEGG pathways were likely to express the biological properties of the participants’ blood cells. If they are intraregulated, the identified genes and variants are probably effective. To determine whether the identified genes were intraregulated, we uploaded the selected TFs to two databases that validated the regulatory relationships between TFs: Regnetworkweb and TRRUST2. Regnetworkweb considers only direct regulatory relationships between TFs whereas TRRUST2 considers indirect regulatory relationships; for example, two TFs targeting the same genes (Figs 3–12). These are clearly highly intracorrelated. Thus, in terms of regulatory relationships, the identified TFs are reasonable.

Download:

Fig 3. Regulatory network between TFs in the “CD4 T cells category” under the “Max correlated set” in Table 4.

Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 4. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g003

Download:

Fig 4. Regulatory network between TFs in the “CD4 T cells category” under the “Max correlated set” in Table 6.

Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 6. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g004

Download:

Fig 5. Regulatory network between TFs in “Neutrophils” under the “Max correlated set” in Table 4.

Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 4. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g005

Download:

Fig 6. Regulatory network between TFs in “Neutrophils” under the “Max correlated set” in Table 6.

Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 6. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g006

Download:

Fig 7. Regulatory network between TFs in “Monocytes” under the “Max correlated set” in Table 4.

Upper: Regnetworkweb, lower: TTRUST2. Blue genes in Regnetworkweb and yellow genes in TTRUST2 are TFs in Table 4. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g007

Download:

Fig 8. Regulatory network between TFs in “Monocytes” under the “Max correlated set” in Table 6.

Upper: Regnetworkweb, lower: TTRUST2. Blue genes in Regnetworkweb and yellow genes in TTRUST2 are TFs in Table 6. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g008

Download:

Fig 9. Regulatory network between TFs in “CD4 T cells” under the “Clinically correlated set” in Table 4.

Upper: Regnetworkweb, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 4. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g009

Download:

Fig 10. Regulatory network between TFs in “CD4 T cells” under the “Clinically correlated set” in Table 6.

Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 6. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g010

Download:

Fig 11. Regulatory network between TFs in “Neutrophils” under the “Clinically correlated set” in Table 4.

Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 4. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g011

Download:

Fig 12. Regulatory network between TFs in “Neutrophils” under the “Clinically correlated set” in Table 6.

Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 6. Blue genes in TTRUST2 are associated with these.

https://doi.org/10.1371/journal.pone.0289029.g012

To validate the overlap between TFs that target genes identified based on gene expression or DNA methylation and those identified based on TFBS, the total number of human TFs must be determined, and we assume that it is approximately 2000 [27]. Table 16 shows the results of Fisher’s test. TFs identified for gene expression significantly overlapped with TFBS associated with genomic variants, even though no significant overlaps were found between the TFBS identified using genomic variants and TFs identified for methylation (not shown here).

Download:

Table 16. Fisher’s exact tests between TFs in Table 4 and those identified through the TFBSs of genomic variants detected by KTD based unsupervised FE.

https://doi.org/10.1371/journal.pone.0289029.t016

One might wonder why we did not compare our performance with that of existing methods. To the best of our knowledge, no other methods are comparable to ours. First, our analysis of association studies between gene expression, methylation, and genomic variants is free from location restrictions; this method can detect any kind of association between these genes, independent of their location along the genome. For example, we can identify interactions between genes and genomic variants that are distant from each other. This is because we can derive the singular value vectors attributed to the subjects, , at the very beginning of the data analysis flow just after applying TD to x_jj′k. Genomic regions and/or variants were then selected based on singular value vectors attributed to genomic regions or variants i_k. Application of TD to x_jj′k requires a very small amount of computational resources, as . To our knowledge, no other method can select genomic regions and variants using such a small amount of computational resources. In particular, treating genomic variants is difficult. For gene expression and methylation, these values can be averaged within individual genomic regions, resulting in a reduced dimension of i_k, that is, N_k. Nevertheless, this cannot be performed for genomic variants because the integers (1, 2, and 3) derived from the genomic variants are arbitrary. Averaging distinct integer numbers attributed to individual genomic variants can destroy the meaning of these integers. Despite this, our method is independent of the size of N_k, and can be applied to genomic variants as is. To the best of our knowledge, no other methods can perform this task; thus, we could not compare the performance of our method with that of any other method.

Our methods do not require prior knowledge of the subjects. Singular value vectors attributed to the subjects, , can be generated by applying TD to x_jj′M, which does not require any additional information about the subjects. The selection of used to select i_K was based on the coincidence between those computed for individual autosomes. Therefore, genomic regions and variants can be selected in a fully unsupervised manner. However, the selected genes were significantly targeted by multiple TFs that were enriched in KEGG pathway terms.

Several biological insights were obtained from this population-based study. One possible application to clinical studies is to compare the outcomes of the present study with those of other clinical studies. Generally, both population- and clinical-based studies have their own biases, and by comparing their outcomes with each other, we can validate their outcomes, which is impossible when only individual outcomes are present.

However, this method had several limitations. First, it is applicable to multiomics datasets that share samples. In addition, because this is an unsupervised method, if there are no significant results in the downward analyses, we have no way to improve the results.

References

1. Bell JT, Pai AA, Pickrell JK, Gaffney DJ, Pique-Regi R, Degner JF, et al. DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biology. 2011;12(1). pmid:21251332
- View Article
- PubMed/NCBI
- Google Scholar
2. Seal DB, Das V, Goswami S, De RK. Estimating gene expression from DNA methylation and copy number variation: A deep learning regression model for multi-omics integration. Genomics. 2020;112(4):2833–2841. pmid:32234433
- View Article
- PubMed/NCBI
- Google Scholar
3. Wang H, Lou D, Wang Z. Crosstalk of Genetic Variants, Allele-Specific DNA Methylation, and Environmental Factors for Complex Disease Risk. Frontiers in Genetics. 2019;9:695. pmid:30687383
- View Article
- PubMed/NCBI
- Google Scholar
4. Shi X, Radhakrishnan S, Wen J, Chen JY, Chen J, Lam BA, et al. Association of CNVs with methylation variation. npj Genomic Medicine. 2020;5(1). pmid:33062306
- View Article
- PubMed/NCBI
- Google Scholar
5. Roudbar MA, Mohammadabadi MR, Mehrgardi AA, Abdollahi-Arpanahi R, Momen M, Morota G, et al. Integration of single nucleotide variants and whole-genome DNA methylation profiles for classification of rheumatoid arthritis cases from controls. Heredity. 2020;124(5):658–674.
- View Article
- Google Scholar
6. Lea AJ, Vockley CM, Johnston RA, Del Carpio CA, Barreiro LB, Reddy TE, et al. Genome-wide quantification of the effects of DNA methylation on human gene regulation. eLife. 2018;7:e37513. pmid:30575519
- View Article
- PubMed/NCBI
- Google Scholar
7. Natri HM, Bobowik KS, Kusuma P, Crenna Darusallam C, Jacobs GS, Hudjashov G, et al. Genome-wide DNA methylation and gene expression patterns reflect genetic ancestry and environmental differences across the Indonesian archipelago. PLOS Genetics. 2020;16(5):1–21. pmid:32453742
- View Article
- PubMed/NCBI
- Google Scholar
8. Blake LE, Roux J, Hernando-Herraez I, Banovich NE, Perez RG, Hsiao CJ, et al. A comparison of gene expression and DNA methylation patterns across tissues and species. Genome Research. 2020;30(2):250–262. pmid:31953346
- View Article
- PubMed/NCBI
- Google Scholar
9. Alakärppä E, Salo HM, Valledor L, Cañal MJ, Häggman H, Vuosku J. Natural variation of DNA methylation and gene expression may determine local adaptations of Scots pine populations. Journal of Experimental Botany. 2018;69(21):5293–5305. pmid:30113688
- View Article
- PubMed/NCBI
- Google Scholar
10. Franke L, Jansen RC. eQTL Analysis in Humans. In: Methods in Molecular Biology. Humana Press; 2009. p. 311–328. Available from: https://doi.org/10.1007/978-1-60761-247-6_17.
11. Taguchi YH, Turki T. Novel feature selection method via kernel tensor decomposition for improved multi-omics data analysis. BMC Medical Genomics. 2022;15(1). pmid:35209912
- View Article
- PubMed/NCBI
- Google Scholar
12. Komaki S, Shiwa Y, Furukawa R, Hachiya T, Ohmomo H, Otomo R, et al. iMETHYL: an integrative database of human DNA methylation, gene expression, and genomic variation. Human Genome Variation. 2018;5(1). pmid:29619235
- View Article
- PubMed/NCBI
- Google Scholar
13. ;. Available from: http://imethyl.iwate-megabank.org/.
14. Analysis pipelines for the GTEx Consortium and TOPMed;. https://github.com/broadinstitute/gtex-pipeline.
15. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2012;29(1):15–21. pmid:23104886
- View Article
- PubMed/NCBI
- Google Scholar
16. Hachiya T, Furukawa R, Shiwa Y, Ohmomo H, Ono K, Katsuoka F, et al. Genome-wide identification of inter-individually variable DNA methylation sites improves the efficacy of epigenetic association studies. npj Genomic Medicine. 2017;2(1). pmid:29263827
- View Article
- PubMed/NCBI
- Google Scholar
17. Tadaka S, Katsuoka F, Ueki M, Kojima K, Makino S, et al. 3.5KJPNv2: an allele frequency panel of 3552 Japanese individuals including the X chromosome. Human Genome Variation. 2019;6(1). pmid:31240104
- View Article
- PubMed/NCBI
- Google Scholar
18. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM; 2013.
19. Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2018;
- View Article
- Google Scholar
20. Taguchi YH. Unsupervised Feature Extraction Applied to Bioinformatics. Springer International Publishing; 2020. Available from: https://doi.org/10.1007/978-3-030-22456-1.
21. Durinck S, Spellman PT, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature Protocols. 2009;4(8):1184–1191. pmid:19617889
- View Article
- PubMed/NCBI
- Google Scholar
22. R Core Team. R: A Language and Environment for Statistical Computing; 2019. Available from: https://www.R-project.org/.
23. Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Research. 2016;44(W1):W90–W97. pmid:27141961
- View Article
- PubMed/NCBI
- Google Scholar
24. Liu ZP, Wu C, Miao H, Wu H. RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse. Database. 2015;2015. pmid:26424082
- View Article
- PubMed/NCBI
- Google Scholar
25. Han H, Cho JW, Lee S, Yun A, Kim H, Bae D, et al. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Research. 2017;46(D1):D380–D386.
- View Article
- Google Scholar
26. Oscanoa J, Sivapalan L, Gadaleta E, Dayem Ullah AZ, Lemoine NR, Chelala C. SNPnexus: a web server for functional annotation of human genome sequence variation (2020 update). Nucleic Acids Research. 2020;48(W1):W185–W192. pmid:32496546
- View Article
- PubMed/NCBI
- Google Scholar
27. Brivanlou AH, Darnell JE. Signal Transduction and the Control of Gene Expression. Science. 2002;295(5556):813–818. pmid:11823631
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Bell JT, Pai AA, Pickrell JK, Gaffney DJ, Pique-Regi R, Degner JF, et al. DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biology. 2011;12(1). pmid:21251332
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Seal DB, Das V, Goswami S, De RK. Estimating gene expression from DNA methylation and copy number variation: A deep learning regression model for multi-omics integration. Genomics. 2020;112(4):2833–2841. pmid:32234433
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Wang H, Lou D, Wang Z. Crosstalk of Genetic Variants, Allele-Specific DNA Methylation, and Environmental Factors for Complex Disease Risk. Frontiers in Genetics. 2019;9:695. pmid:30687383
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Shi X, Radhakrishnan S, Wen J, Chen JY, Chen J, Lam BA, et al. Association of CNVs with methylation variation. npj Genomic Medicine. 2020;5(1). pmid:33062306
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Roudbar MA, Mohammadabadi MR, Mehrgardi AA, Abdollahi-Arpanahi R, Momen M, Morota G, et al. Integration of single nucleotide variants and whole-genome DNA methylation profiles for classification of rheumatoid arthritis cases from controls. Heredity. 2020;124(5):658–674.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref6] 6. Lea AJ, Vockley CM, Johnston RA, Del Carpio CA, Barreiro LB, Reddy TE, et al. Genome-wide quantification of the effects of DNA methylation on human gene regulation. eLife. 2018;7:e37513. pmid:30575519
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref7] 7. Natri HM, Bobowik KS, Kusuma P, Crenna Darusallam C, Jacobs GS, Hudjashov G, et al. Genome-wide DNA methylation and gene expression patterns reflect genetic ancestry and environmental differences across the Indonesian archipelago. PLOS Genetics. 2020;16(5):1–21. pmid:32453742
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref8] 8. Blake LE, Roux J, Hernando-Herraez I, Banovich NE, Perez RG, Hsiao CJ, et al. A comparison of gene expression and DNA methylation patterns across tissues and species. Genome Research. 2020;30(2):250–262. pmid:31953346
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Alakärppä E, Salo HM, Valledor L, Cañal MJ, Häggman H, Vuosku J. Natural variation of DNA methylation and gene expression may determine local adaptations of Scots pine populations. Journal of Experimental Botany. 2018;69(21):5293–5305. pmid:30113688
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref10] 10. Franke L, Jansen RC. eQTL Analysis in Humans. In: Methods in Molecular Biology. Humana Press; 2009. p. 311–328. Available from: https://doi.org/10.1007/978-1-60761-247-6_17.

[ref11] 11. Taguchi YH, Turki T. Novel feature selection method via kernel tensor decomposition for improved multi-omics data analysis. BMC Medical Genomics. 2022;15(1). pmid:35209912
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref12] 12. Komaki S, Shiwa Y, Furukawa R, Hachiya T, Ohmomo H, Otomo R, et al. iMETHYL: an integrative database of human DNA methylation, gene expression, and genomic variation. Human Genome Variation. 2018;5(1). pmid:29619235
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref13] 13. ;. Available from: http://imethyl.iwate-megabank.org/.

[ref14] 14. Analysis pipelines for the GTEx Consortium and TOPMed;. https://github.com/broadinstitute/gtex-pipeline.

[ref15] 15. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2012;29(1):15–21. pmid:23104886
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref16] 16. Hachiya T, Furukawa R, Shiwa Y, Ohmomo H, Ono K, Katsuoka F, et al. Genome-wide identification of inter-individually variable DNA methylation sites improves the efficacy of epigenetic association studies. npj Genomic Medicine. 2017;2(1). pmid:29263827
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref17] 17. Tadaka S, Katsuoka F, Ueki M, Kojima K, Makino S, et al. 3.5KJPNv2: an allele frequency panel of 3552 Japanese individuals including the X chromosome. Human Genome Variation. 2019;6(1). pmid:31240104
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref18] 18. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM; 2013.

[ref19] 19. Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2018;
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref20] 20. Taguchi YH. Unsupervised Feature Extraction Applied to Bioinformatics. Springer International Publishing; 2020. Available from: https://doi.org/10.1007/978-3-030-22456-1.

[ref21] 21. Durinck S, Spellman PT, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature Protocols. 2009;4(8):1184–1191. pmid:19617889
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref22] 22. R Core Team. R: A Language and Environment for Statistical Computing; 2019. Available from: https://www.R-project.org/.

[ref23] 23. Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Research. 2016;44(W1):W90–W97. pmid:27141961
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref24] 24. Liu ZP, Wu C, Miao H, Wu H. RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse. Database. 2015;2015. pmid:26424082
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref25] 25. Han H, Cho JW, Lee S, Yun A, Kim H, Bae D, et al. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Research. 2017;46(D1):D380–D386.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref26] 26. Oscanoa J, Sivapalan L, Gadaleta E, Dayem Ullah AZ, Lemoine NR, Chelala C. SNPnexus: a web server for functional annotation of human genome sequence variation (2020 update). Nucleic Acids Research. 2020;48(W1):W185–W192. pmid:32496546
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref27] 27. Brivanlou AH, Darnell JE. Signal Transduction and the Control of Gene Expression. Science. 2002;295(5556):813–818. pmid:11823631
View Article
PubMed/NCBI
Google Scholar

[85] View Article

[86] PubMed/NCBI

[87] Google Scholar

Abstract

Figures

Introduction

Materials and methods

Data set

Preprocessing

Tensor decomposition-based unsupervised feature extraction

Identification of genes associated with selected genomic regions

Enrichment analysis

Transcription factor regulation analysis

Identification of TFBSs and genes associated with detected genetic variants

Results

Identification of s of interest

Selection of genomic regions and variants, and their biological validation

“Max correlated set”: CD4 T cells.

“Max correlated set”: Neutrophils.

“Max correlated set”: Monocytes.

“Clinically correlated set”: CD4 T cells.

“Clinically correlated set”: Neutrophils.

Discussion

References

Cookie Preference Center

Customize Your Cookie Preference