Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Comparative analyses of chloroplast genomes of Theobroma cacao from northern Peru

  • Daniel Tineo ,

    Roles Data curation, Formal analysis, Investigation, Methodology, Software, Writing – original draft

    dt.infolab@gmail.com

    Affiliation Instituto de Investigación para el Desarrollo Sustentable de Ceja de Selva (INDES–CES), Universidad Nacional Toribio Rodríguez de Mendoza, Chachapoyas, Amazonas, Perú

  • Danilo E. Bustamante,

    Roles Conceptualization, Formal analysis, Investigation, Supervision, Validation

    Affiliations Instituto de Investigación para el Desarrollo Sustentable de Ceja de Selva (INDES–CES), Universidad Nacional Toribio Rodríguez de Mendoza, Chachapoyas, Amazonas, Perú, Instituto de Investigación de Ingeniería Ambiental, Facultad de Ingeniería Civil y Ambiental (FICIAM), Universidad Nacional Toribio Rodríguez de Mendoza, Chachapoyas, Amazonas, Perú

  • Martha S. Calderon,

    Roles Conceptualization, Investigation, Validation, Writing – review & editing

    Affiliations Instituto de Investigación para el Desarrollo Sustentable de Ceja de Selva (INDES–CES), Universidad Nacional Toribio Rodríguez de Mendoza, Chachapoyas, Amazonas, Perú, Instituto de Investigación de Ingeniería Ambiental, Facultad de Ingeniería Civil y Ambiental (FICIAM), Universidad Nacional Toribio Rodríguez de Mendoza, Chachapoyas, Amazonas, Perú

  • Manuel Oliva

    Roles Conceptualization, Investigation, Project administration, Resources

    Affiliation Instituto de Investigación para el Desarrollo Sustentable de Ceja de Selva (INDES–CES), Universidad Nacional Toribio Rodríguez de Mendoza, Chachapoyas, Amazonas, Perú

Abstract

Theobroma cacao is the most economically important species within the genus Theobroma. Despite its importance, the intraspecific relationships of this species has not been fully elucidated due to insufficient molecular information. To facilitate a better understanding of the intraspecific evolutionary relationships of T. cacao, Sequencing technology has been to decode the plastid genomes, with the objective of identify potential DNA barcode genetic markers, explore intraspecific relationships, and infer divergence times. The plastid genome of the seven cocoa genotypes analyzed in this study, exhibited a typical angiosperm genomic structure. However, the structure of each plastid genome reflects notable changes in each genotype; for example, the infA gene was present in all the analyzed samples, unlike in previously published cocoa plastid genomes, while the complete ycf1 gene sequence has potential for use as DNA Barcoding in T. cacao. The estimated age of the node connecting T. cacao and T. grandiflorum, which was 10.11 Ma, supports this indication. It can be inferred that T. cacao diverged at approximately 7.55 Ma, and it is highly likely that T. cacao populations diversified during the Pliocene or Miocene. Therefore, it is crucial to perform mitochondrial and nuclear-based analyses on a broader spectrum of cocoa samples to validate these evolutionary mechanisms, including genetic estimates and divergence. This approach enables a deeper understanding of the evolutionary relationships among cocoa.

Introduction

Theobroma L. is a genus within the Malvaceae family that encompasses 22 species [1]. The most economically important species is Theobroma cacao L. [2]. This tropical understory tree originated in the Amazon basin in South America but grown as a commercial crop on plantations in Africa, Asia, and America, as it is an important source of income for many farmers in those regions [3,4]. According to Utro et al. [5], cocoa is among the ten principal agricultural commodities worldwide. The high market value of T. cacao is attributed to the flavonoids it contains. These secondary metabolites are associated with numerous health benefits, such as reducing the risk of cardiovascular diseases [6]. Apart from being the source of chocolate, cocoa beans offer carbohydrates, fats, proteins, natural minerals, and vitamins [7].

The cocoa industry traditionally distinguishes three main types of cocoa: Forastero, Criollo, and Trinitario. These cocoa varieties are naturally distributed from southern Bolivia to Mexico [811]. While there is some historical ambiguity surrounding their nomenclature, the Forastero variety is commonly recognized in the industry for its sturdiness and features dark purple kernels, which have a bitter taste and sometimes a sour flavor [10,12]. The Criollo variety is less resilient than the other varieties, resulting in the production of kernels that are lightly pigmented or white. These kernels possess a desirable aroma and slight bitterness [13,14]. On the other hand, Trinitario cultivars exhibit high yield and disease resistance, producing kernels that have a much milder taste [10,15,16]. It has been suggested Trinitario may be a hybrid resulting from a combination of Forastero and Criollo varieties [1719]. Recent scientific research utilizing microsatellite markers has identified ten distinct genetic clusters of cacao [20]. These groups are distributed throughout various South American countries. Amelonado is found in Brazil, Costa Rica, and Ghana; Contamana and Iquitos can be found in Peru and Brazil; Criollo is present in Ecuador, Venezuela, Panama, Costa Rica, and Mexico; National and Curaray are exclusive to Ecuador. Guiana is solely present in Brazil, while Marañón can be found in Peru and Bolivia. Nanay is exclusive to Peru, while Purús is found in Brazil and Bolivia [8]. Moreover, recent studies conducted by Zhang et al. [20], Motamayor et al. [11], and Osorio-Guarín et al. [21] indicated the presence of further cacao populations in Bolivia, Peru, and Colombia, respectively. It is probable that further distinguishable genetic clusters of cocoa will emerge with the increase in exploration of the untamed territories of South America [22]. However, Single Nucleotide Polymorphism (SNP) analyses is another valuable method to elucidate genetic diversity and identify variations in plant genomes [20]. Despite this potential, high-yield genotyping is economically unfeasible in several developing countries where cacao cultivation occurs [4]. These polymorphic DNA sequences and regions may be valuable for evolutionary and phylogenetic research on the genus Theobroma and family Malvaceae in the future. This approach allows us to elucidate taxonomic ambiguities and pinpoint taxa closely associated with cocoa farming. The designed markers might also prove useful in distinguishing between genetically similar cultivars and wild taxa for breeding initiatives [22]. Recent updates of the T. cacao genome have provided a new and approachable structure for exploring evolutionary proceedings, structural and functional genetics, biochemistry, and comparative genomics of the cacao tree [13,14].

The Amazonas region in Peru is the seventh most productive region for Fine Aroma Cocoa, which is renowned for its distinctive aroma and taste and holds great esteem in the global marketplace [23]. In 2015, the Regional Government of Amazonas introduced the denomination of origin for “Cacao Amazonas Peru” via the Regional Ordinance N° 368 [24]. The decision was made based on cocoa’s bromatological qualities, its growing environment, and the Amazonas region’s important role in its genetic diversity [23,24]. In this region, numerous studies have investigated volatile fingerprints [25]; fatty acids [26]; phenolic, aromatic and physicochemical compounds [27,28]; and phenotypic traits [29,30]. However, a detailed molecular characterization of Fine Aroma Cocoa is still lacking [28]. The only genetic study that supports the genetic diversity of fine aroma cocoa in this region is described by Bustamante el al [31], who using genotyping technology reported the presence of ten genetic groups described by Motamayor et al. [11]. However, the genome structure and functionality of Amazonas Fine Aroma Cocoa have not been fully elucidated, since a structural and functional characterization will allow determining the expression and function of genes associated with various traits in cocoa, such as genes associated with various biological interactions between the cacao tree and diseases such as Phytophthora [13,32]. Therefore, the aforementioned methods are crucial for expediting the advancement of potential cultivars via the utilization of pioneering biotechnological methods [33].

In this study, we sequenced and assembled nine complete plastid genomes sourced from Fine Aroma Cocoa (T. cacao) cultivation in northeastern Peru. We scrutinized each genome to identify potential DNA barcode genetic markers, explore intraspecific relationships, and infer divergence times when compared to other available plastid genomes in the GenBank database. These findings may assist in distinguishing distinct Fine Aroma Cocoa varieties.

Materials and methods

Fine aroma cocoa sample collection

Fine Aroma Cocoa samples were collected in Bagua and Utcubamba provinces (665–902 m.a.s.l.) of the Amazonas region in northeastern Peru (Table 1). Servicio Nacional Forestal de Fauna y Flora Silvestre (SERFOR) granted a wild flora scientific research permit for the collection of Fine Aroma Cocoa (MIDAGRI-SERFOR-DGGSPFFS, with authorization code N° AUT-IFL-2020-0051). Samples were obtained from seven Fine Aroma Cocoa genotypes as described by Bustamante et al. [31] in the Bagua and Utcubamba provinces of the Amazonas region (S1 Fig). Approximately 100 mm2 of tender cacao leaves were collected for molecular analyses and placed in pre-labeled 2 mL Eppendorf Safelock tubes. The aforementioned samples were deposited in the KUELAP herbarium under the National University Toribio Rodríguez de Mendoza [34]. The deposit includes comprehensive information pertaining to the sampling sites alongside the characteristics of the plants sampled. Information such as the collection code, date, altitude, locality and GPS coordinates was recorded for each collection site. The voucher codes for each sample are: KUELAP-611, KUELAP-619, KUELAP-638, KUELAP-646, KUELAP-655, KUELAP-659 and KUELAP-663 (Table 1).

thumbnail
Table 1. Collection codes for samples of fine aroma cocoa (T. cacao).

https://doi.org/10.1371/journal.pone.0316148.t001

DNA extraction, sequencing, assembly and annotation

The National University Toribio Rodríguez de Mendoza de Amazonas Laboratory of Molecular Biology and Genomics conducted the DNA extraction. Genomic DNA was obtained using the NucleoSpin Kit (Macherey-Nagel, Düren, Germany). Subsequently, a NanoDrop and Qubit (Thermo Fisher Scientific, Waltham, MA, USA) were used for optical density measurements of the DNA. Genomic DNA was sequenced commercially by Macrogen (Seoul, SouthKorea). Briefly, the concentration and purity of the DNA were verified before library preparation through agarose gel electrophoresis and Agilent Tapestation. The genomic DNA was fragmented and ligated with individual adapters using the Swift 2S Turbo DNA library preparation using PCR with kit from Swift Bioscience, headquartered in Ann Harbor, MI, USA. Next, we evaluated the size distribution and concentration of the resulting library using Qubit and TapeStation. Library sequencing was carried out on the NextSeq 500 platform developed by Illumina, San Diego, CA, in compliance with established procedures. Has been generated paired 150 nucleotide (nt) reads and checked them for data quality using FastQC from the Babraham Institute located in Cambridge, UK. The plastid genomes were assembled using de novo assembly with MEGAHIT [35], SPAdes-3.13.0 software [36], getorganelle v 1.7.5.3 [37] and visualized with Bandage v 0.8.1 [38]. The reference genome employed during the assembly process was T. cacao (HQ336404; Jansen et al. [39]). The precision and circularity of the genome were validated by mapping the reads and contigs with the same mapping tool used for reference in Geneious Prime, v. 2020.0.3. The entire chloroplast genome was annotated through MFannot [40], NCBI ORFfinder, and tRNAscan-SE 2.0 [41]. Afterward, comparison with the reference genome in Geneious Prime allowed manual correction.

Simple sequence repeats and dispersed repeats

The software tool used for identifying SSRs in T. cacao genome sequences was the MicroSatellite Identification Program [42], which is accessible via https://pgrc.ipk–gatersleben.de/misa/. The tool employs a range of parameter settings depending on the unit size (nucleotides) of the SSR, varying from 1_10 for mononucleotide repeats to 6_3 for hexanucleotide repeats, using configuration for Malvales described by Beier et al [42]. A minimum separation of 100 base pairs was considered for identifying two SSRs. To compare the genome structures of T. cacao, we utilized the Online IRscope program (https://irscope.shinyapps.io/irapp/). This program facilitated a comparison of the positions of the IR, SSC and LSC regions across the 19 cp genomes of T. cacao.

T. cacao polymorphism analysis

All Theobroma plastid genomes were aligned with MAFFT v. 7.0.17 [43]. Geneious Prime v. 2023.0.3 was used to calculate the number of mutation and indels events employing the approximate p-value calculation method; indels were considered to be events rather than sites in the alignment with a minimum coverage of 1, minimum variation frequency of 0.25, and minimum string bias of p-value = 10−7 [44].

Phylogenomic analysis and search for specific genes

The seven complete plastid genome sequences generated in this study were combined with 13 Theobroma plastome sequences obtained from GenBank (Table 2). Theobroma grandiflorum (JQ228388, Kane et al. [45]) was used as the outgroup. Sequence alignment was performed with the MAFFT plugin version 7.0.17 [43], while PartitionFinder-2.1.1 [46] was used to select the best suited model for the complete plastid genomes. Phylogenetic trees were created using the maximum likelihood and Bayesian inference methods with IQ-TREE v.2.2.0 software [47]. The test model (-m TEST) [48] was used in conjunction with 1,500 ultrafast bootstrap replicates. To construct the gene tree and intergenic polymorphic regions, gene splitting was performed using the Ape 5.0 package [49] through RStudio statistical software [50]. From this process, bootstrap and UFOBORT files were constructed utilizing 1,500 ultrafast repeats in IQ-TREE v.2.2.0. All the trees produced were combined and analyzed with ASTRAL–III software [51] to establish a consensus tree with 1,500 replicates. The phylogenetic trees were visualized using TreeDyn 198.3 on Phylogeny.fr [52].

thumbnail
Table 2. List of sequences of the chloroplast genome of T. cacao generated in this study and downloaded from NCBI used for data analysis.

https://doi.org/10.1371/journal.pone.0316148.t002

Estimation of T. cacao divergence time

The estimation of the divergence time of Malvaceae was initially conducted using 38 plastome sequences obtained from GenBank (S1 Table). The outgroup consisted of Carica papaya (EU431223), Mangifera indica (KX871231), and Tapiscia sinensis (MF926267). All CDSs found in each species were extracted manually through the use of Geneious Prime, v. 2023.0.3. The CDS dataset was analyzed on the CIPRES Science Gateway portal using a xml input file produced in BEAUti v.1.7.2 [53] within BEAST v1.10.4 [53]. The superior evolutionary model was determined according to the results of PartitionFinder-2.1.1 (GTR +  I +  G substitution model). BEAST analyses were conducted using an a priori birth-death speciation model [54] and an uncorrelated relaxed clock model [55] with a lognormal distribution. To constrain the age of the crown node of Malvaceae, fossil-based calibration points were employed and set to 70.7 Ma with a normal prior and standard deviation equal to 5, following the work of Wang et al. [56]. Four BEAST runs were executed for 400,000,000 generations each, with parameters sampled every 1.000 generations. The effective sample size (ESS > 200) was determined using Tracer v1.7 [57], with 25% of the samples removed as burn-in and 30% of the trees discarded. We employed TreeAnnotator v1.8.4 [58] to generate the maximum clade credibility (MCC) tree displaying mean divergence time estimates alongside 95% highest posterior density (HPD) intervals. Based on these findings, divergence was exclusively carried out for the Theobroma genus using the same parameters as those used for Malvaceae. The Theobroma crown node calibration points were adjusted by limiting them to 10.11 Ma with a normal prior and stdev =  4. Six BEAST runs were conducted per 200,000,000 generations each, while parameters were sampled every 1,000 generations.

Results

Plastomic features of 19 T. cacao sequences

In this study, the chloroplast genomes of seven T. cacao specimens were sequenced. Illumina single-end sequencing revealed a total of 2.23 ×  106, 2.14 ×  106, 2.98 ×  106, 2.26 ×  106, 2.21 ×  106, 2.42 ×  106 and 2.22 ×  106 150 bp reads for each sample, including INDES06, INDES14, INDES34, INDES50, INDES63, INDES67, and INDES71 with an average sequencing depth of 1,570.7; 973.7; 488.34; 1,161.2; 990.4; 203.26 and 990.1 respectively (S2 Fig). On average, 160 Mb of high–quality sequence was obtained from each specimen. Illumina sequencing of plastid DNA produced between 2,142,060 to 2,998,427 clean reads (150 bp) for the seven T. cacao samples analyzed. Seven full plastid genomes were obtained through assembly and annotation. The genomes of these angiosperms exhibit a typical genomic structure, as shown in Fig 1. The genes ranged from 160,589 to 160,727 bp in length and had a GC percentage of 36.9 (Table 3). The gene content comprises 130 genes, including 37 tRNAs, 8 rRNAs, and 85 protein-coding genes (Table 3). The inverted repetitive region (IR) contained 17 duplicated genes, six of which were protein–coding (four rRNA and seven tRNA). Furthermore, Table 4 outlines 22 genes associated with photosynthesis, eight genes associated with proton exchange, and 18 genes linked to electron exchange.

thumbnail
Table 3. Characteristics of complete chloroplast genomes of T. cacao.

https://doi.org/10.1371/journal.pone.0316148.t003

thumbnail
Table 4. Genes encoded in the chloroplast genomes of T. cacao.

https://doi.org/10.1371/journal.pone.0316148.t004

thumbnail
Fig 1. Circular genetic map of the general characteristics of the chloroplast genome of T. cacao.

The map contains six default tracks. From the center outward, the first track shows the scattered repeats connected with arcs. The second track shows the long tandem repeats as short bars. The third track shows short tandem repeats or microsatellite sequences as short bars. The small single-copy (SSC), inverted repeat (IRa and IRb) and large single-copy (LSC) regions are shown in the fourth track. The GC content in the genome is represented in the fifth track. The genes are shown in the sixth track. The optional codon usage bias is shown in parentheses after the gene name. Genes are coded according to their functional classification. The transcription directions of the inner and outer genes are clockwise and counterclockwise, respectively. The functional classification of the genes is shown in the lower left corner.

https://doi.org/10.1371/journal.pone.0316148.g001

Simple sequence repeats and dispersed repeats

The number of simple sequence repeats (SSRs) found in the 19 chloroplast genomes of T. cacao ranged between 73 and 80. A and T were determined to be the most common SSRs, with no G-type mononucleotides present in any of the T. cacao samples (S3A Fig). The most prevalent SSR in terms of the frequency of classified repeat types (in relation to the complementary sequence) was A/T-type mononucleotide (S3B Fig). The chloroplast genomes exhibited diverse types of SSRs, the most prevalent of which were single nucleotide repeats, which occurred 64–67 times. In contrast, dinucleotide repeats appeared only six times, followed by trinucleotide and pentanucleotide repeats once and one to two times, respectively. Notably, no tetranucleotide repeats were detected (S3C Fig). In the specified length intervals (30–39, 40–49, 50–59, 60–69 and ≥  70), the most abundant SSRs were between 30 and 39 nucleotides in length, followed by those 50–59 in length. The ranges of 40–49, 60–69 and ≥  70 exhibited the fewest SSRs, as illustrated in S3D Fig. The most prevalent types across all samples were repeats and palindromic repeats, whereas complementary repeats were the least frequent.

Inverted-repeat contraction, expansion, and interspecific comparison

We analyzed the junctions of the inverted repeat (IR) region and the two single–copy regions in the 19 Theobroma genomes, which included the 7 genomes examined in this study, as well as the adjacent gene locations (Fig 2). The long single-copy (LSC), IR, and short single-copy (SSC) regions had comparable lengths. The genes located at the junction sites consisted of rpl22, rps19, rpl2, ndhF, ycf1, trnN, trnH, and psbA. Although the rpl22 gene was present in the LSC region, it was detected in only 8 of the genomes. The rps19 gene was identified at the junction of the LSC and IRb sections, while the rpl2 gene was located solely within the IRa and IRb regions and was detected in just 8 genomes. The ndhF gene was detected within the SSC and IRb regions at 3 to 6 bp intervals, except for the INDES14 (KUELAP-219) genome, where it was exclusively located in the IRb region. Similarly, the ycf1 gene is typically found within the SSC region, but in the case of the INDES67 (KUELAP-659) genome, it was identified in both the SSC and IRa regions, with only a 4 bp gap between them (Fig 2). The trnN gene was fully located within the IRa region in all 19 genomes. The trnH gene was located in the LSC region and crosses the boundary of 2 bp in the IRa region. Moreover, the psbA gene was detected in the LSC region in all 19 genomes (Fig 2). Additionally, 12 cis-splicing and one trans-splicing genes were identified in all genomes.

thumbnail
Fig 2. A comparison of large single-copy (LSC), inverted repeat (IR), and small single-copy (SSC) junction positions was conducted across 19 T. cacao plastomes.

The distance to the boundary or the length of genes in single-copy regions and IR regions is indicated next to each gene.

https://doi.org/10.1371/journal.pone.0316148.g002

Polymorphism analysis of T. cacao chloroplast genomes

A total of 80 polymorphic sites were identified in the chloroplast whole-genome sequences of 19 T. cacao genomes, among which were 56Indels and 58 singleton variants. The genes matK, atpF, rpoC2, psbC, psaA, cemA, rpl32, ccsA, ndhD, psaC and ycf1 exhibited 14 variation sites in total (Fig 3). The genes ycf1 (3 variants), rpoC2 and psbC (each with 2 variants) had the highest number of variation sites. Moreover, these genes also had the greatest number of Indels (2 each). The intergenic regions that exhibited the most variation were found to be rpl32-trnL, matK-rps16, nahF-rpl32, atpH-atpF, and rps15-yfc1 (Fig 3; S4 Fig). These variations, along with the use of T. grandiflorum as an outgroup, resulted in a tree with significantly high support values on each branch (Fig 3). The alignment of all the plastid genomes indicated a high degree of similarity (greater than 50% identity) in the total sequence, with intraspecific divergence of 0.0006 and 0.04%, respectively. The interspecific divergence between T. grandiflorum and Fine Aroma Cocoa amounted to 0.28%. Furthermore, we identified three Fine Aroma Cocoa groups that exhibit comparable genomic characteristics. The first of these groups (BS/BPP =  99/1), contained samples INDES63 (OP354235), INDES14 (OP354233), INDES06 (OP354232), and INDES50 (OP354234). Collectively, these strains formed a sister clade to other T. cacao genomes (JQ228389, HQ336404, JQ228380, and JQ228381). A second clade, with a posterior probability/bootstrap value of 100/1, consisted of sample INDES34 (MZ725364). This sample was closely related to the JQ228382, HQ244500, JQ228383, and KY085907 genomes. The third clade, (with a posterior probability/bootstrap value of 100/1), included INDES67 (MZ725365), and INDES71 (OP354236). This clade was closely related to the cocoa genomes JQ228385, JQ228386, JQ228387, and JQ228379 (Fig 3).

thumbnail
Fig 3. Phylogenomic tree of T. cacao generated by maximum likelihood inference.

The nodes indicate bootstrap support (BS) and posterior probability (BPP) and are presented above the branches. The scale represents the number of nucleotide substitutions per site. The horizontal bars show whole genomes, and the colored vertical bars on each genome indicate SNPs.

https://doi.org/10.1371/journal.pone.0316148.g003

Phylogenetic analysis

The tree topology based on the whole T. cacao genome was shown to be identical to that of three genes (matK, infA, and ycf1) as well as three intergenic regions (rpl32-trnL, matK-rps16, and nahF-rpl32) (Fig 4A and B). The rpl32-trnL spacer (S4A Fig) and the ycf1 gene located in the SSC region (S4D Fig) were found to be the most appropriate regions for uniform topology on an independent basis. This ability is determined by analyzing the complete sequence of the plastid genomes and considering the exons and introns. On the other hand, the matK-rps16 (S4B Fig) and nahF-rpl32 (S4C Fig) spacers, as well as the matK and infA genes, exhibited low topology similarity (S4E and F Fig). The intraspecific differentiation rate of the matK +  infA +  ycf1 combination was 1%, which included coding regions of 9,438 bp and noncoding regions of 4,805 bp, demonstrating noteworthy similarity, with an identity greater than 50%. The ycf1 and matK genes exhibited divergence rates of 0.01–0.9% and 0.03–0.1%, respectively. Nevertheless, the infA coding region presented a high degree of intraspecific divergence (38%) due to a deletion of 80 bp in some genomes, leading to nonrecognition of this infA gene (S5 Fig). The combination of noncoding regions showed intraspecific divergence ranging from 0.04–0.2%. Additionally, the spacer sequences of rpl32-trnL, matK-rps16, and nahF-rpl32 exhibited divergence rates of 0.08–0.2%, 0.07–0.1%, and 0.09–0.5%, respectively.

thumbnail
Fig 4. Maximum-likelihood phylogenetic inference of 19 T. cacao species based on the matK + infA + ycf1 genes (A) and the rpl32trnL + matK–rps16 + nahF–rpl32 spacers (B).

The numbers associated with the nodes are bootstrap support (BS) values. The scale indicates the number of nucleotide substitutions per site.

https://doi.org/10.1371/journal.pone.0316148.g004

Estimation of Theobroma divergence time

The age of the Malvaceae crown node was estimated to be 70.7 Ma, while that of the Theobroma stem was 52.4 million years (S6 Fig). The node age of T. cacao and T. grandiflorum was estimated to be 10.11 million years ago (Fig 5). These estimations suggest that the 19 species of T. cacao had a common ancestor approximately 7.55 million years ago (95% HPD), diverging into three clades approximately 3.83 million years ago (95% HPD) (A), 3.61 million years ago (95% HPD) (B), and 3.56 million years ago (95% HPD) (C). From this time onwards, many species began to undergo independent evolution during the Pleistocene epoch, which lasted from approximately 0.31 to 1.82 million years (Fig 5). Samples INDES67 (MZ725365) and INDES71(OP354236) shared a common ancestor estimated to have lived approximately 850,000 years ago. Samples INDES06 (OP354232) and INDES50 (OP354234) similarly shared a common ancestor approximately 310,000 years ago. In addition, it is estimated that samples INDES63 (OP354235) and INDES14 (OP354233) shared a common ancestor approximately 750,000 years ago, while INDES34 (MZ725364) appeared approximately 1.3 million years ago (Fig 5).

thumbnail
Fig 5. Theobroma chronogram based on protein coding sequences estimated from BEAST.

The values at the nodes indicate divergence dates in millions of years.

https://doi.org/10.1371/journal.pone.0316148.g005

Discussion

Chloroplast genome structure

In this study, the plastidial genomes of Fine Aroma Cocoa were decoded. This is the first study in the Amazonas region in which massive sequencing technologies have been used to locate and assign functions to plastid genes in this important crop. The structure, content, organization, and characteristics of the plastid genomes of seven Fine Aroma Cocoa samples (INDES06, INDES14, INDES34, INDES50, INDES63, INDES67, and INDES71) demonstrated significant similarity to other Theobroma plastid genomes, including those of T. grandiflorum [20], as well as to plastid genomes of Gossypium [59], Tilia [60,61], and Hibiscus [62]. However, there were apparent variations in the sizes of the LSC, SSC, and IR regions (Fig 2, Table 3). These differences indicate that the IR regions are more stable within T. cacao, a phenomenon that prevails throughout other Malvales species [56]. Although there were considerable increases in the RI and LSC boundaries within Gossypium [59], Tilia [59,61], and Hibiscus [62], these expansions were modest. These modifications to the IR regions could be linked to the formation of pseudogenes, similar to what occurs in Malpighiales [63].

Simple sequence repeats and dispersed repeat contents

Our findings showed that Theobroma sequences exhibit comparable GC contents [13,22]. All the genomes shared similar properties, including a total number of genes (130), duplicated genes (17), and protein-coding genes (85), with the exception of the infA gene, which functions in T. cacao and other genera of Malvaceae, such as Tilia [60,61]. However, in T. grandiflorum, its function has yet to be determined [22], and in other nearby genera, such as Gossypium [59], Hibiscus [22,62], and Sida [64], the infA gene functions as a pseudogene. The infA gene was identified in seven examined genomes in this study (S5 Fig). The infA gene functions to regulate the selection of mRNA, creating the preinitiation complex. Nevertheless, the gene is absent in other published cocoa genomes because of an 80-base pair loss, which renders it unrecognizable (S5 Fig). This finding suggested that the gene may have undergone an evolutionary event, been transferred, or been functionally replaced by another gene in the nucleus. This hypothesis is supported by nuclear transcriptome analysis, which revealed genetic transfers, such as infA and rpl32, from the chloroplast to the nucleus in Hypericum ascyron [65]. Moreover, the research conducted by Park et al. [66] and Millen et al. [67] on Thalictrum coreanum and other angiosperms, respectively, demonstrated evolutionary variations that suggest such gene transfers are possible. A majority of genes contain an AUG initiation codon, except for GTG [68]. However, according to the results of the present study, the infA gene harbors a UUG initiation codon, which is effective at precisely initiating the translation of infA mRNA [65,67], despite its inefficiency as an initiation codon [68]. However, further studies on the nuclear transcriptome of T. cacao species are necessary to elucidate the evolutionary changes in the infA gene. It is also crucial to investigate other unknown functions within the plastid genome [22]. Additionally, it is essential to locate other genes that are typically present more frequently in subtelomeric regions in cacao [13,14]. It has been determined that codons with the same terminus (A/T) are responsible for encoding most amino acids. This trend is also observed in genomes with high AT percentages, which are typical of Malvaceae species [13,22,56]. One potential reason for the increased frequency of A/T repeats is polyadenylation at the mRNA end in the cp genes of various species [61]. In addition, during plastome replication, strand separation is easier for A/T pairs than for G/C pairs [69]. Therefore, the simple sequence repeats (SSRs) identified in this study will be beneficial for future population genetics and evolutionary investigations of Fine Aroma Cocoa. Furthermore, these SSRs are important sources of molecular markers for biogeographic research [70,71].

Polymorphism analysis of T. cacao chloroplast genomes

The variation in size observed in each plastid genome of Fine Aroma Cocoa is attributed to the accumulation of indels, which results in genetic variation [22,72]. Additionally, this genetic variation could have arisen from virus–derived eukaryotic genes, prompting genetic material exchange and early diversification in plastidial and mitochondrial genomes through glycosyltransferases [73], viruses use genes from their hosts to replicate and spread throughout plants [74,75], thereby overcoming the host immune system and becoming crucial assets for adaptation [76,77], and evolution [78,79]. While some of these genes are conserved, others serve novel functions in plants [80,81]. However, this hypothesis has not yet been tested in cacao plantations in Amazonas, as Bustamante et al. [31] reported that the majority of cacao samples from the Amazonas region analyzed by SNPs were heterozygous, meaning that the coexistence of multiple alleles within plant cells leads to high levels of heterogeneity in the products of plastid genome copies, resulting in the occurrence of transposable elements (TEs) [8285], as occurs in species of Gossypium [59], Tilia [60,61] and Hibiscus [62]. This suggests that the Fine Aroma Cocoa plantations exhibit genetic variability due to successive backcrossing of trees, along with various insertions or deletions within the plastidial genomes caused by environmental factors, resulting in polymorphisms that can persist across generations of populations. The genetic material of plantations with low homozygosity is relatively conserved [31]. Sequence differences were detected in the exon and intronic regions of each genome, suggesting the potential use of intraspecific SNPs in identifying differential allele expression, including interspecific hybrid expression studies [86]. Given that every haplotype has a unique cpSNP profile, we can distinguish genetic clusters within the cacao population [19]. For instance, trnH-psbA chloroplast region SNPs are utilized as markers to identify cacao haplotypes [87]. However, our study revealed that the ycf1 gene has a greater number of SNPs than other genes and could be a more efficient method for assessing intraspecific genetic variability in Fine Aroma Cocoa. This is due to the ycf1 gene having both SNPs and Indels in coding sequences. Additionally, other genes, such as infA and rpoC2, were described in the present study; however, these genes were insufficient for distinguishing genetic groups within Fine Aroma Cocoa. Alternatively, paternal transmission (paternal leakage) of chloroplasts through pollen [83] may also contribute to the variation of repetitive simple sequences or the presence of indels in some varieties of Fine Aroma Cocoa. Nevertheless, the hypothesis has yet to be subjected to rigorous examination in the context of cocoa plants, underscoring the necessity for further research to elucidate the phenomenon of autocopatibility. Future studies will concentrate on identifying viral sequences integrated into the cacao genome to determine possible pathways associated with probable the horizontal gene transfer (HGT) and to investigate the existence of heteroplasmy and/or polyplasmy. since, genetic heterogeneity of a few homozygous trees prevalent in Fine Aroma Cocoa on various plantations in Amazonas region [31].

Phylogenomic analysis and search for specific genes

Phylogenomic analysis has shown that the plastid genomes INDES71 (OP354236), INDES67 (MZ725365) (Fig 5; Clade A) are closely related to the genetic group previously identified as Trinitarios and cacao criollo pure [45], whose samples were collected in Suriname, Trinidad and Tobago and the United States [11,88]. However, the study carried out by Bustamante et al. [31] produced different results, revealing that Amazonas cocoa shares genetic material with almost all the varieties previously described by Motamayor et al. [11], where one variety usually predominates over the others. For example, of the two samples INDES71 (OP354236) and INDES67 (MZ725365), the INDES67 sample exhibited the highest genetic loads for the National (33) and Contamana (26.21%) varieties [31]; the INDES71 sample was generated by combining 46.05% of the Iquitos variety and 39.25% of the National variety, as reported by Bustamante et al. [31]. On the other hand, the INDES34 sample was grouped in clade B with varieties of Trinitarios and Forasteros from the Lower Amazonas. Finally, four of the seven plastid genomes examined within clade C in this study display resemblances to outsider samples from the Lower Amazonas. Although the primary genetic variations present in INDES06 (OP354232) were National (68%) and Iquitos (12.46%), in INDES14 (OP354233) and INDES63 (OP354235), National (81.68% and 75.69%) and Curaray (17.49% and 19.65%) were more prevalent. However, the clarity of the data is limited. In the INDES50 (OP354234) dataset, the National (36.02%) and Amelonado (13.8%) genetic varieties are prevalent [31]. To determine the predominant genetic group in these samples, the original material must be evaluated. The distinction between genetic groups indicates significant interbreeding and gene flow in cocoa [45]. Hence, it is crucial to conduct research at the genomic level to comprehend the diversity of Fine Aroma Cocoa. Even if two varieties share a recent ancestor in common, distinct differences can still be identified and demonstrated [45].

Furthermore, to clarify the complex connection between T. cacao species, inter- or intraspecific analysis requires specific or universal markers. Therefore, the chosen barcode should be both variable and conserved to facilitate successful design, PCR amplification, and sequencing [89]. The initial plant barcodes selected were rbcL and matK [90], with rbcL being deemed optimal for lower plants [89] and matK being deemed optimal for angiosperms [91]. Other commonly utilized regions of the plant molecular systematics plastid genome include atpF-H, psbK-I, ropC1, rpoB, trnH-psbA, and trnL-F [9295]. However, this study revealed these regions to be ineffective at distinguishing genetic groups within T. cacao. These methods may be more advantageous for genus-level classifications within the Malvaceae. For instance, trnH-psbA exhibited poor universal marker efficiency due to its variability across all plastid genomes [96]. Its variability exceeds that of matK and rbcL [97]. However, its use as a universal barcode is limited by inversion and insertion [89]. This phenomenon was also observed in the other regions examined in this study, namely, rpl32-trnL, matK-rps16, nahF-rpl3, trnS-G, accD-psaI, atpF-H, psbK-I, ropC1, rpoB, and trnL-F. In addition, these regions were not useful for distinguishing between organisms of the same T. cacao species, whereas combinations of several intergenic regions, such as rpl32-trnL, matK-rps16, and nahF-rpl3, were among the other combinations and proved to be more effective in differentiating organisms within T. cacao (Fig 4). Thus, the potential of using these markers in combination to distinguish among groups of T. cacao cannot be ignored, as the amalgamation of these regions proves to be more practical for distinguishing individuals of separate species [9295].

In contrast, previous studies have suggested that ycf1 and ndhF provide valuable data for DNA barcoding owing to high levels of variation in flowering plants [90,98]. This study established that the entire ycf1 gene sequence enables the differentiation of individuals within the same species of T. cacao more effectively than combinations of the matK +  infA +  ycf1 genes (Fig 4A) and the rpl32-trnL +  matK-rps16 +  nahF-rpl32 spacers (Fig 4B). These findings suggest that the ycf1 gene can function as a universal marker for demarcating species within Malvaceae and other plant groups, similar to its use in several phylogenetic applications for Pinaceae [99], Orchidaceae [100], Lamiaceae [101], and Prunus [98]. It has also been successful in studies of several angiosperms, gymnosperms, monilophytes and bryophytes [102]. The ycf1 gene is functional and essential for plant viability because it acts as a protein precursor and is not usually lost [103]. However, its application might not be useful in all taxa [104] due to the absence of the ycf1 gene in Poaceae species [89].

Estimation of T. cacao divergence time

Recent research has indicated that Fine Aroma Cocoa plants are genetically descended from the National variety, with some genetic contributions from Criollo and other varieties [31]. However, the genetic differentiation of these cacao plants has not been determined. Using seventy-eight coding sequences of complete plastid genomes, this study estimated the divergence time of Malvaceae to be approximately 70.7 Ma, which is in agreement with the results of Wang et al. [56]. Although there are certain limitations in the methods of analyzing and sampling taxa, as noted by Wang et al. [56], the results of this study are consistent with our ability to calculate the age of diversification of T. cacao (7.55 Ma), suggesting that this economically important species has had ample time to generate significant within species genetic diversity [105]. It can be deduced that between 3.5 and 3.9 Ma, three ancient and distinct lineages emerged and dispersed into cacao populations (clades A, B and C; Fig 5). These populations resulted in the majority of the samples analyzed in this study, and they originated in a recent Pleistocene era (0.31–2.3 Ma), with the exception of samples JQ228379 and JQ228381, which date back to the Pliocene era (approximately 3.61–3.83 Ma); it is likely that cocoa populations diverged during the Pliocene or Miocene epochs. Other contemporary populations may have adapted during the Holocene era, which was the most recent epoch of the Quaternary period. The separation of the three now extinct lineages (clades A, B, and C, as depicted in Fig 5) does not necessarily imply that the current individuals are pure. This study confirmed that the samples analyzed exhibited a certain degree of genetic material from the various genetic groups previously described by Motamayor et al. [11]. This finding supports the hypothesis proposed by Bustamante et al. [31] that a significant portion of Fine Aroma Cocoa plants are heterozygous, with a relatively small number of homozygous individuals. Evidence of this evolutionary process includes the partial loss of the infA gene (S5 Fig) in published cocoa samples, while our analyzed samples contained the complete gene. In addition to the variability in insertions and deletions has also been detected within the plastid genome, as has been observed for the variability in evolutionary rate between genes and lineages in cotton chloroplasts [85]. Therefore, the possibility that these evolutionary phenomena occurred in Fine Aroma Cacao cannot be excluded, given that the genus Theobroma diversified at an accelerated rate within Malvaceae during the mid-Miocene Andean uplift [105].

Conclusions

In this study, the plastid genomes of Fine Aroma Cocoa were decoded. This is an innovative application of large-scale sequencing technologies in Peru, which aids in the identification and analysis of plastid genes of this important crop in the Amazonas region. As a result, complete sequencing of the plastid genomes provides a more precise understanding of the intraspecific relationship of Fine Aroma Cocoa. These findings suggest that plastid genomes retain the common genomic structure of angiosperms, containing 130 genes, 37 tRNAs, 8 rRNAs, and 85 protein-coding genes, with a 36.9% GC content. The structure of the plastid genome also demonstrated notable evolutionary development, as the infA gene was present in all the samples analyzed, in contrast with published cacao plastid genomes. Furthermore, the complete sequence of the ycf1 gene was found to hold more promise for studying intraspecific relationships in T. cacao. Finally, the estimated ages of the T. cacao and T. grandiflorum nodes date back to 10.11 million years ago (Ma). These approximations suggest that T. cacao diverged approximately 7.55 Ma, and it is highly probable that cacao populations diversified during the Pliocene or Miocene epochs. It is imperative to conduct mitochondrial and nuclear studies on a greater number of cocoa samples to determine the credibility of these evolutionary processes, including genetic estimates and divergence. This approach allows us to investigate the validity of the aforementioned processes.

Supporting information

S1 Fig. Map collections of the 7 trees of the T. cacao from the Region Amazonas, northern Peru.

This map was created by the authors using open access resources. The national, provincial, and district boundaries were obtained from the Geoportal of the National Geographic Institute of Peru (IGN) in shapefile format with a DATUM WGS 1984, following link: https://www.idep.gob.pe/geovisor/VisorDeMapas-3D/, which is located within the spatial information MED: http://sigmed.minedu.gob.pe/descargas/ (accessed on 6 August 2023). The map is for illustrative purposes only.

https://doi.org/10.1371/journal.pone.0316148.s001

(TIF)

Table S1. List of species used for divergence analysis of Malvaceae, including the new complete chloroplast genomes of T. cacao.

https://doi.org/10.1371/journal.pone.0316148.s002

(DOCX)

S2 Fig. The sequencing depth map of the T. cacao chloroplast genome. a =  INDES06, b =  INDES14, c =  INDES34, d =  INDES50, e =  INDES63, f =  INDES67 and g =  INDES71.

The depth of each base was calculated by samtools depth.

https://doi.org/10.1371/journal.pone.0316148.s003

(TIF)

S3 Fig. Distribution of SSRs and dispersed repeats in the chloroplast genomes of T. cacao.

(A) Frequency of identified SSR motifs; (B) Frequency of classified repeat types (considering sequence complementary); (C) Numbers of different SSR types detected in the cp genomes; (D) Numbers of dispersed repeat types having a given length interval (30 to 39, 40 to 49, 50 to 59, 60 to 69 and ≥  70).

https://doi.org/10.1371/journal.pone.0316148.s004

(TIF)

S4 Fig. Maximum-likelihood phylogenetic inference of 21 T. cacao individuals based on the rpl32_trnL spacer (A), matK_rps16 spacer (B), ndhF_rpl32 space (C), ycf1 gene (D), matK gene (E) and infA gene (F).

The numbers associated with the nodes are bootstrap support (BS) values. The scale indicates the number of nucleotide substitutions per site.

https://doi.org/10.1371/journal.pone.0316148.s005

(TIF)

S5 Fig. Comparison of the infA gene in the different chloroplast genomes of T. cacao, including Theobroma grandiflorum.

https://doi.org/10.1371/journal.pone.0316148.s006

(TIF)

S6 Fig. Chronogram of Malvales based on 78 CDSs sequences estimated from BEAST.

The red and blue star represent two fossil constraints and the green star represents one secondary calibrations obtained from the literature.

https://doi.org/10.1371/journal.pone.0316148.s007

(TIF)

Acknowledgments

We are grateful to Marco A. Pasapera Alvitres for collecting the samples.

References

  1. 1. Bayer C, Fay MF, Bruijn AY, Savolainen V, Morton CM, Kubitzki K, et al. Support for an expanded family concept of Malvaceae within a recircumscribed order Malvales: a combined analysis of plastid atpB and rbcL DNA sequences. Bot J Linn Soc. 1999;129(4):267–303.
  2. 2. Gopaulchan D, Motilal LA, Bekele FL, Clause S, Ariko JO, Ejang HP, et al. Morphological and genetic diversity of cacao (Theobroma cacao L.) in Uganda. Physiol Mol Biol Plants. 2019;25(2):361–75. pmid:30956420
  3. 3. Bartley BG. The genetic diversity of cacao and its utilization. Wallingford; CABI: 2005.
  4. 4. Da Silva MR, Clément D, Gramacho KP, Monteiro WR, Argout X, Lanaud C, et al. Genome-wide association mapping of sexual incompatibility genes in cacao (Theobroma cacao L.). Tree Genet Genomes. 2016;12(3):1–13.
  5. 5. Utro F, Cornejo OE, Livingstone D, Motamayor JC, Parida L. ARG-based genome-wide analysis of cacao cultivars. BMC Bioinf. 2012;13(S19):1–11.
  6. 6. Hooper L, Kay C, Abdelhamid A, Kroon PA, Cohn JS, Rimm EB, et al. Effects of chocolate, cocoa, and flavan-3-ols on cardiovascular health: a systematic review and meta-analysis of randomized trials. Am J Clin Nutr. 2012;95(3):740–51. pmid:22301923
  7. 7. Boza EJ, Motamayor JC, Amores FM, Cedeño-Amador S, Tondo CL, Livingstone DS, et al. Genetic characterization of the cacao cultivar CCN 51: its impact and significance on global cacao improvement and production. J Am Soc Hortic Sci. 2014;139(2):219–29.
  8. 8. Motamayor JC, Risterucci AM, Lopez PA, Ortiz CF, Moreno A, Lanaud C. Cacao domestication I: the origin of the cacao cultivated by the Mayas. Heredity (Edinb). 2002;89(5):380–6. pmid:12399997
  9. 9. Wickramasuriya AM, Dunwell JM. Cacao biotechnology: current status and future prospects. Plant Biotechnol J. 2018;16(1):4–17. pmid:28985014
  10. 10. Cheesman EE. Notes on the nomenclature, classification and possible relationships of cocoa populations. Trop Agric. 1944;2:144–59.
  11. 11. Motamayor JC, Lachenaud P, da Silva e Mota JW, Loor R, Kuhn DN, Brown JS, et al. Geographic and genetic population differentiation of the Amazonian chocolate tree (Theobroma cacao L). PLoS One. 2008;3(10):e3311. pmid:18827930
  12. 12. Loor RG, Risterucci AM, Courtois B, Fouet O, Jeanneau M, Rosenquist E, et al. Tracing the native ancestors of the modern Theobroma cacao L. population in Ecuador. Tree Genet Genomes. 2009;5(3):421–33.
  13. 13. Argout X, Salse J, Aury J-M, Guiltinan MJ, Droc G, Gouzy J, et al. The genome of Theobroma cacao. Nat Genet. 2011;43(2):101–8. pmid:21186351
  14. 14. Argout X, Martin G, Droc G, Fouet O, Labadie K, Rivals E, et al. The cacao Criollo genome v2. 0: an improved version of the genome for genetic and functional genomic studies. BMC Genom. 2017;18(1):1–9.
  15. 15. Bekele FL, Bidaisee GG, Bhola J. A comparative morphological study of two Trinitario groups from the International Cocoa Genbank, Trinidad. Annu. Cocoa Research Unit, University of the West Indies: 2007. pp 34–42.
  16. 16. Motilal LA, Zhang D, Umaharan P, Mischke S, Mooleedhar V, Meinhardt LW. The relic Criollo cacao in Belize – genetic diversity and relationship with Trinitario and other cacao clones held in the International Cocoa Genebank, Trinidad. Plant Genet Resour. 2010;8(2):106–15.
  17. 17. Motamayor JC, Risterucci AM, Heath M, Lanaud C. Cacao domestication II: progenitor germplasm of the Trinitario cacao cultivar. Heredity (Edinb). 2003;91(3):322–30. pmid:12939635
  18. 18. Motilal LA, Sreenivasan TN. Revisiting 1727: crop failure leads to the birth of Trinitario cacao. J Crop Improv. 2012;26(5):599–626.
  19. 19. Yang JY, Scascitelli M, Motilal LA, Sveinsson S, Engels JMM, Kane NC, et al. Complex origin of Trinitario-type Theobroma cacao (Malvaceae) from Trinidad and Tobago revealed using plastid genomics. Tree Genet Genomes. 2013;9(3):829–40.
  20. 20. Zhang D, Martínez WJ, Johnson ES, Somarriba E, Phillips-Mora W, Astorga C, et al. Genetic diversity and spatial structure in a new distinct Theobroma cacao L. population in Bolivia. Genet Resour Crop Evol. 2012;59(2):239–52.
  21. 21. Osorio-Guarín JA, Berdugo-Cely J, Coronado RA, Zapata YP, Quintero C, Gallego-Sánchez G, et al. Colombia a source of cacao genetic diversity as revealed by the population structure analysis of germplasm bank of Theobroma cacao L. Front Plant Sci. 2017;8:290189.
  22. 22. Abdullah , Waseem S, Mirza B, Ahmed I, Waheed MT. Comparative analyses of chloroplast genomes of Theobroma cacao and Theobroma grandiflorum. Biologia. 2020;75(5):761–71.
  23. 23. MINAGRI. Estudio del cacao del Perú y el mundo: Un análisis de la producción y el comercio. 2018. Available from: https://www.minagri.gob.pe/portal/monitoreo-agroclimatico/cacao-2018
  24. 24. El Peruano. Declaran de interés regional la obtención de la denominación de origen del “Cacao Amazonas Perú”; Ordenanza Regional Nº 368, Gobierno Regional Amazonas/CR. 2015. Disponible en: Available from: https://busquedas.elperuano.pe/normaslegales/declaran-de-interes-regional-la-obtencion-de-la-denominacion-ordenanza-no-368-gobierno-regional-amazonascr-1270354-1/
  25. 25. Valle-Epquín MG, Balcázar-Zumaeta CR, Auquiñivín-Silva EA, Fernández-Jeri AB, Idrogo-Vásquez G, Castro-Alayo EM. The roasting process and place of cultivation influence the volatile fingerprint of Criollo cacao from Amazonas, Peru. Sci Agropecu. 2020;11(4):599–610.
  26. 26. Oliva-Cruz M, Mori-Culqui PL, Caetano AC, Goñas M, Vilca-Valqui NC, Chavez SG. Total fat content and fatty acid profile of fine-aroma cocoa from northeastern Peru. Front Nutr. 2021;8:677000. pmid:34291070
  27. 27. Castro-Alayo EM, Idrogo-Vásquez G, Siche R, Cardenas-Toro FP. Formation of aromatic compounds precursors during fermentation of Criollo and Forastero cocoa. Heliyon. 2019;5(1):e01157. pmid:30775565
  28. 28. Ordoñez ES, Quispe Y, García LF. Cuantificación de fenoles, antocianinas y caracterización sensorial de nibs y licor de cinco variedades de cacao, en dos sistemas de fermentación. Sci Agropecu. 2020;11(4):473–81.
  29. 29. Oliva-Cruz M, Maicelo-Quintana JL. Identificación y selección de ecotipos de cacao nativo fino de aroma de la zona Nor oriental del Perú. Rev Invest Agrop Sust. 2020;4(2):31–9.
  30. 30. Oliva-Cruz M, Goñas M, García LM, Rabanal-Oyarse R, Alvarado-Chuqui C, Escobedo-Ocampo P, et al. Phenotypic characterization of fine-aroma cocoa from northeastern Peru. Int J Agron. 2021;2021:1–12.
  31. 31. Bustamante DE, Motilal LA, Calderon MS, Mahabir A, Oliva M. Genetic diversity and population structure of fine aroma cacao (Theobroma cacao L.) from north Peru revealed by single nucleotide polymorphism (SNP) markers. Front Ecol Evol. 2022;10:895056.
  32. 32. Micheli F, Guiltinan M, Gramacho KP, Wilkinson MJ, Figueira AV de O, Cascardo JC de M, et al. Functional genomics of Cacao. In: Adv Bot Res. 2010;55:119–77.
  33. 33. Fister AS, Shi Z, Zhang Y, Helliwell EE, Maximova SN, Guiltinan MJ. Protocol: transient expression system for functional genomics in the tropical tree Theobroma cacao L. Plant Methods. 2016;12(1):1–13.
  34. 34. Thiers B, Index Herbariorum. A global directory of public herbaria and associated staff. New York Botanical Garden’s Virtual Herbarium: 2016[cited 2021. ]. Available from: http://sweetgum.nybg.org/science/ih
  35. 35. Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31(10):1674–6. pmid:25609793
  36. 36. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77. pmid:22506599
  37. 37. Jin J-J, Yu W-B, Yang J-B, Song Y, dePamphilis CW, Yi T-S, et al. GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biol. 2020;21(1):241. pmid:32912315
  38. 38. Wick RR, Schultz MB, Zobel J, Holt KE. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics. 2015;31(20):3350–2. pmid:26099265
  39. 39. Jansen RK, Saski C, Lee S-B, Hansen AK, Daniell H. Complete plastid genome sequences of three Rosids (Castanea, Prunus, Theobroma): evidence for at least two independent transfers of rpl22 to the nucleus. Mol Biol Evol. 2011;28(1):835–47. pmid:20935065
  40. 40. Beck N, Lang B. MFannot, organelle genome annotation webserver. Quebec, Canada; Université de Montréal. 2010 [cited 2023 Nov 02. ]. Available from: https://megasun.bch.umontreal.ca/cgi-bin/mfannot/mfannotInterface.pl
  41. 41. Lowe TM, Chan PP. tRNAscan-SE on-line: integrating search and context for analysis of transfer RNA genes. Nucleic Acids Res. 2016;44(W1):W54–7. pmid:27174935
  42. 42. Beier S, Thiel T, Münch T, Scholz U, Mascher M. MISA-web: a web server for microsatellite prediction. Bioinformatics. 2017;33(16):2583–5. pmid:28398459
  43. 43. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–80. pmid:23329690
  44. 44. Ingvarsson PK, Ribstein S, Taylor DR. Molecular evolution of insertions and deletion in the chloroplast genome of silene. Mol Biol Evol. 2003;20(11):1737–40. pmid:12832644
  45. 45. Kane N, Sveinsson S, Dempewolf H, Yang JY, Zhang D, Engels JMM, et al. Ultra-barcoding in cacao (Theobroma spp.; Malvaceae) using whole chloroplast genomes and nuclear ribosomal DNA. Am J Bot. 2012;99(2):320–9. pmid:22301895
  46. 46. Lanfear R, Frandsen PB, Wright AM, Senfeld T, Calcott B. PartitionFinder 2: new methods for selecting partitioned models of evolution for molecular and morphological phylogenetic analyses. Mol Biol Evol. 2017;34(3):772–3. pmid:28013191
  47. 47. Trifinopoulos J, Nguyen L-T, von Haeseler A, Minh BQ. W-IQ-TREE: a fast online phylogenetic tool for maximum likelihood analysis. Nucleic Acids Res. 2016;44(W1):W232–5. pmid:27084950
  48. 48. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. Corrigendum to: IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol. 2020;37(8):2461. pmid:32556291
  49. 49. Paradis E, Schliep K. Ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2019;35(3):526–8. pmid:30016406
  50. 50. RStudio Team. RStudio: integrated development environment for R. 2022. Available from: http://www.rstudio.com/
  51. 51. Zhang C, Rabiee M, Sayyari E, Mirarab S. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinf. 2018;19(S6).
  52. 52. Dereeper A, Guignon V, Blanc G, Audic S, Buffet S, Chevenet F, et al. Phylogeny.fr: robust phylogenetic analysis for the non-specialist. Nucleic Acids Res. 2008;36(Web Server issue):W465–9. pmid:18424797
  53. 53. Drummond AJ, Rambaut A. BEAST. Bayesian evolutionary analysis by sampling trees. BMC Evol Biol. 2007;7:1–8.
  54. 54. Gernhard T. The conditioned reconstructed process. J Theor Biol. 2008;253(4):769–78. pmid:18538793
  55. 55. Drummond AJ, Ho SYW, Phillips MJ, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS Biol. 2006;4(5):e88. pmid:16683862
  56. 56. Wang JH, Moore MJ, Wang H, Zhu ZX, Wang HF. Plastome evolution and phylogenetic relationships among Malvaceae subfamilies. Gene. 2021;765:145103. pmid:32889057
  57. 57. Rambaut A, Drummond AJ, Xie D, Baele G, Suchard MA. Posterior summarization in Bayesian phylogenetics using tracer 1.7. Syst Biol. 2018;67(5):901–4. pmid:29718447
  58. 58. Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol. 2012;29(8):1969–73. pmid:22367748
  59. 59. Xu Q, Xiong G, Li P, He F, Huang Y, Wang K, et al. Analysis of complete nucleotide sequences of 12 Gossypium chloroplast genomes: origin and evolution of allotetraploids. PLoS One. 2012;7(8):e37128. pmid:22876273
  60. 60. Cai J, Ma P-F, Li H-T, Li D-Z. Complete Plastid genome sequencing of four Tilia species (Malvaceae): a comparative analysis and phylogenetic implications. PLoS One. 2015;10(11):e0142705. pmid:26566230
  61. 61. Yan L, Wang H, Huang X, Li Y, Yue Y, Wang Z, et al. Chloroplast genomes of genus Tilia: comparative genomics and molecular evolution. Front Genet. 2022;13:925726. pmid:35873491
  62. 62. Cheng Y, Zhang L, Qi J, Zhang L. Complete chloroplast genome sequence of Hibiscus cannabinus and comparative analysis of the Malvaceae family. Front Genet. 2020;11:227. pmid:32256523
  63. 63. Menezes APA, Resende-Moreira LC, Buzatti RSO, Nazareno AG, Carlsen M, Lobo FP, et al. Chloroplast genomes of Byrsonima species (Malpighiaceae): comparative analysis and screening of high divergence sequences. Sci Rep. 2018;8(1):2210. pmid:29396532
  64. 64. Guo D-Q, Li H-L, Liu C, Zhang H, Du H-H, Zhou N. The complete chloroplast genome and phylogenetic analysis of Sida szechuensis matsuda (Malvaceae). Mitochondrial DNA B Resour. 2021;6(11):3146–7. pmid:34746387
  65. 65. Claude S-J, Park S, Park S. Gene loss, genome rearrangement, and accelerated substitution rates in plastid genome of Hypericum ascyron (Hypericaceae). BMC Plant Biol. 2022;22(1):135. pmid:35321651
  66. 66. Park S, Jansen RK, Park S. Complete plastome sequence of Thalictrum coreanum (Ranunculaceae) and transfer of the rpl32 gene to the nucleus in the ancestor of the subfamily Thalictroideae. BMC Plant Biol. 2015;15(1):40. pmid:25652741
  67. 67. Millen RS, Olmstead RG, Adams KL, Palmer JD, Lao NT, Heggie L, et al. Many parallel losses of infA from chloroplast DNA during angiosperm evolution with multiple independent transfers to the nucleus. Plant Cell. 2001;13(3):645–58. pmid:11251102
  68. 68. Hirose T, Ideue T, Wakasugi T, Sugiura M. The chloroplast infA gene with a functional UUG initiation codon. FEBS Lett. 1999;445(1):169–72. pmid:10069394
  69. 69. Zhao F, Li B, Drew BT, Chen Y-P, Wang Q, Yu W-B, et al. Leveraging plastomes for comparative analysis and phylogenomic inference within Scutellarioideae (Lamiaceae). PLoS One. 2020;15(5):e0232602. pmid:32379799
  70. 70. Kyalo CM, Gichira AW, Li Z-Z, Saina JK, Malombe I, Hu G-W, et al. Characterization and comparative analysis of the complete chloroplast genome of the critically endangered species Streptocarpus teitensis (Gesneriaceae). Biomed Res Int. 2018;2018:1–11.
  71. 71. Mustafina FU, Yi D-K, Choi K, Shin CH, Tojibaev KS, Downie SR. A comparative analysis of complete plastid genomes from Prangos fedtschenkoi and Prangos lipskyi (Apiaceae). Ecol Evol. 2019;9(1):364–77. pmid:30680120
  72. 72. Lee C, Ruhlman TA, Jansen RK. Unprecedented intraindividual structural heteroplasmy in Eleocharis (Cyperaceae, Poales) plastomes. Genome Biol Evol. 2020;12(5):641–55. pmid:32282915
  73. 73. Irwin NAT, Pittis AA, Richards TA, Keeling PJ. Systematic evaluation of horizontal gene transfer between eukaryotes and viruses. Nat Microbiol. 2022;7(2):327–36. pmid:34972821
  74. 74. Filée J, Pouget N, Chandler M. Phylogenetic evidence for extensive lateral acquisition of cellular genes by Nucleocytoplasmic large DNA viruses. BMC Evol Biol. 2008;8:320. pmid:19036122
  75. 75. Catoni M, Noris E, Vaira AM, Jonesman T, Matić S, Soleimani R, et al. Virus-mediated export of chromosomal DNA in plants. Nat Commun. 2018;9(1):5308. pmid:30546019
  76. 76. Koonin EV, Krupovic M. The depths of virus exaptation. Curr Opin Virol. 2018;31:1–8. pmid:30071360
  77. 77. Vardi A, Haramaty L, Van Mooy BAS, Fredricks HF, Kimmance SA, Larsen A, et al. Host–virus dynamics and subcellular controls of cell fate in a natural coccolithophore population. Proc Natl Acad Sci USA. 2012;109(47):19327–32. pmid:23134731
  78. 78. Biémont CA. brief history of the status of transposable elements: from junk DNA to major players in evolution. Genetics. 2010;186(4):1085–93.
  79. 79. Muller E, Ullah I, Dunwell JM, Daymond AJ, Richardson M, Allainguillaume J, et al. Identification and distribution of novel badnaviral sequences integrated in the genome of cacao (Theobroma cacao). Sci Rep. 2021;11(1):8270. pmid:33859254
  80. 80. Liu H, Fu Y, Jiang D, Li G, Xie J, Cheng J, et al. Widespread horizontal gene transfer from double-stranded RNA viruses to eukaryotic nuclear genomes. J Virol. 2010;84(22):11876–87. pmid:20810725
  81. 81. Frank JA, Feschotte C. Co-option of endogenous viral sequences for host cell function. Curr Opin Virol. 2017;25:81–9. pmid:28818736
  82. 82. Wang W, Lanfear R. Long-reads reveal that the chloroplast genome exists in two distinct versions in most plants. Genome Biol. Evol. 2019;11(12):3372–81. pmid:31750905
  83. 83. Gabriel A, Willems M, Mules EH, Boeke JD. Replication infidelity during a single cycle of Ty1 retrotransposition. Proc Natl Acad Sci USA. 1996;93(15):7767–71. pmid:8755550
  84. 84. Broz AK, Keene A, Fernandes Gyorfy M, Hodous M, Johnston IG, Sloan DB. Sorting of mitochondrial and plastid heteroplasmy in Arabidopsis is extremely rapid and depends on MSH1 activity. Proc Natl Acad Sci USA. 2022;119(34):e2206973119. pmid:35969753
  85. 85. Chen Z, Grover CE, Li P, Wang Y, Nie H, Zhao Y, et al. Molecular evolution of the plastid genome during diversification of the cotton genus. Mol Phylogenet Evol. 2017;112:268–76. pmid:28414099
  86. 86. Kuhn DN, Figueira A, Lopes U, Motamayor JC, Meerow AW, Cariaga K, et al. Evaluating Theobroma grandiflorum for comparative genomic studies with Theobroma cacao. Tree Genet Genomes. 2010;6(5):783–92.
  87. 87. Gutiérrez-López N, Ovando-Medina I, Salvador-Figueroa M, Molina-Freaner F, Avendaño-Arrazate CH, Vázquez-Ovando A. Unique haplotypes of cacao trees as revealed by trnH-psbA chloroplast DNA. PeerJ. 2016;4:e1855. pmid:27076998
  88. 88. Lachenaud P, Motamayor JC. The Criollo cacao tree (Theobroma cacao L.): a review. Genet Resour Crop Evol. 2017;64(8):1807–20.
  89. 89. Dong W, Cheng T, Li C, Xu C, Long P, Chen C, et al. Discriminating plants using the DNA barcode rbcLb: an appraisal based on a large data set. Mol Ecol Resour. 2014;14(2):336–43. pmid:24119263
  90. 90. CBOL Plant Working Group. A DNA barcode for land plants. Proc Natl Acad Sci U S A. 2009;106(31):12794–7.
  91. 91. Clement WL, Donoghue MJ. Barcoding success as a function of phylogenetic relatedness in Viburnum, a clade of woody angiosperms. BMC Evol Biol. 2012;12(1):73–13.
  92. 92. Jin WT, Schuiteman A, Chase MW, Li JW, Chung SW, Hsu TC, et al. Phylogenetics of subtribe Orchidinae sl (Orchidaceae; Orchidoideae) based on seven markers (plastid matK, psaB, rbcL, trnL-F, trnH-psba, and nuclear nrITS, Xdh): implications for generic delimitation. BMC Plant Biol. 2017;17(1):1–14.
  93. 93. Tineo D, Bustamante DE, Calderon MS, Mendoza JE, Huaman E, Oliva M. An integrative approach reveals five new species of highland papayas (Caricaceae, Vasconcellea) from northern Peru. PLoS One. 2020;15(12):e0242469. pmid:33301452
  94. 94. Caddah MK, Meirelles J, Nery EK, Lima DF, Nicolas AN, Michelangeli FA, et al. Beneath a hairy problem: phylogeny, morphology, and biogeography circumscribe the new Miconia supersection Discolores (Melastomataceae: Miconieae). Mol Phylogenet Evol. 2022;171:107461. pmid:35351631
  95. 95. Zhang GL, Feng C, Kou J, Han Y, Zhang Y, Xiao HX. Phylogeny and divergence time estimation of the genus Didymodon (Pottiaceae) based on nuclear and chloroplast markers. J Syst Evol. 2023;61(1):115–26.
  96. 96. Whitlock BA, Hale AM, Groff PA. Intraspecific inversions pose a challenge for the trnH-psbA plant DNA barcode. PLoS One. 2010;5(7):e11533. pmid:20644717
  97. 97. Pang X, Liu C, Shi L, Liu R, Liang D, Li H, et al. Utility of the trnH-psbA intergenic spacer region and its combinations as plant DNA barcodes: a meta-analysis. PLoS One. 2012;7(11):e48833. pmid:23155412
  98. 98. Amar MH. ycf1-ndhF genes, the most promising plastid genomic barcode, sheds light on phylogeny at low taxonomic levels in Prunus persica. J Genet Eng Biotechnol. 2020;18(1):42. pmid:32797323
  99. 99. Parks M, Cronn R, Liston A. Increasing phylogenetic resolution at low taxonomic levels using massively parallel sequencing of chloroplast genomes. BMC Biol. 2009;7(1):84. pmid:19954512
  100. 100. Li H, Xiao W, Tong T, Li Y, Zhang M, Lin X, et al. The specific DNA barcodes based on chloroplast genes for species identification of Orchidaceae plants. Sci Rep. 2021;11(1):1424. pmid:33446865
  101. 101. Drew BT, Sytsma KJ. The South American radiation of Lepechinia (Lamiaceae): phylogenetics, divergence times and evolution of dioecy. Bot J Linn Soc. 2013;171:171–90.
  102. 102. Dong W, Liu J, Yu J, Wang L, Zhou S. Highly variable chloroplast markers for evaluating plant phylogeny at low taxonomic levels and for DNA barcoding. PLoS One. 2012;7(4):e35071. pmid:22511980
  103. 103. Wicke S, Schneeweiss GM, dePamphilis CW, Müller KF, Quandt D. The evolution of the plastid chromosome in land plants: gene content, gene order, gene function. Plant Mol Biol. 2011;76(3-5):273–97. pmid:21424877
  104. 104. Guisinger MM, Chumley TW, Kuehl JV, Boore JL, Jansen RK. Implications of the plastid genome sequence of Typha (Typhaceae, Poales) for understanding genome evolution in Poaceae. J Mol Evol. 2010;70(2):149–66. pmid:20091301
  105. 105. Richardson JE, Whitlock BA, Meerow AW, Madriñán S. The age of chocolate: a diversification history of Theobroma and Malvaceae. Front Ecol Evol. 2015;3(120):1–14.