The genus Malus belongs to the family Rosaceae. This family includes several important genera that account for most of our important deciduous fruit crops including apple (Malus), pear (Pyrus), and stone fruits (Prunus) such as peach [Prunus persica (L.) Batsch], cherry [Prunus avium (L.) L.], plum (Prunus domestica L.), apricot (Prunus armeniaca L.), almond [Prunus dulcis (Mill.) D.A. Webb], as well as other valuable ornamental plants including rose (Rosa), medlar (Mespilus), and hawthorn (Crataegus), among others (Challice, 1974). Among these various genera, Malus serves as the most commercially valuable.
Most cultivated apples are diploids (2n = 34), self-incompatible, and display juvenile periods of 6 to 10 yr or more. The apple has a relatively small genome, 1.54 pg DNA/2C or 750 Mb per haploid genome, which is similar to that of the sorghum [Sorghum bicolor (L.) Moench] genome and about the same size as the tomato (Solanum lycopersicum L.) genome (Arumuganathan and Earle, 1991; Tatum et al., 2005). Molecular mapping studies of the apple have been underway with over 1200 isozymes, random amplified polymorphic DNAs, restriction fragment length polymorphisms, amplified fragment length polymorphisms, simple sequence repeats (SSRs), expressed sequence tag (EST)–SSRs, and single nucleotide polymorphisms (SNPs) mapped to different linkage groups (Maliepaard et al., 1998; Liebhard et al., 2003; Naik et al., 2006; Chagné et al., 2008; Celton et al., 2009). At least three bacterial artificial chromosome (BAC) libraries have been constructed for the apple from different genotypes (Vinatzer et al., 2001; Xu et al., 2001, 2002). These BAC libraries have been successfully used for cloning genes of interest (Xu and Korban, 2002; Han et al., 2007). Several apple cultivars have been transformed via Agrobacterium-mediated transformation (James et al., 1989; Yao et al., 1995), and promising transgenic lines with enhanced resistance to important diseases such as apple scab [Venturia inaequalis (Cke.) Wint.] and fire blight [Erwinia amylovora (Burrill) Winslow] have been developed (Aldwinckle et al., 2003; Belfanti et al., 2004; Malnoy et al., 2008). Thus, the apple serves as an ideal model system for all members of the Rosaceae family and is primed to benefit from research efforts in functional genomics.
Large-scale single-pass sequencing of cDNA clones, randomly picked from cDNA libraries, is a very powerful approach for gene discovery and provides an overview of transcriptional activities within tissues (Adams et al., 1993). Expressed sequenced tags are the most widely sequenced nucleotide commodities from plant genomes, as they provide robust sequence resources that can be exploited for gene discovery, genome annotation, and comparative genomics (Arabidopsis Genome Initiative, 2000). Since the completed sequencing of the Arabidopsis genome (Arabidopsis Genome Initiative, 2000), widespread major genomic efforts have been underway for various other plant species (Shoemaker et al., 2002; Van der Hoeven et al., 2002; Forment et al., 2005; Horn et al., 2005; International Rice Genome Project, 2005; Moser et al., 2005; Newcomb et al., 2006; Tuskan et al., 2006). The number of publicly available ESTs from plant species is growing, and many of the sequencing projects are focused on crop species. Recently, there are reports on EST sequencing of fruit species, such as tomato (Van der Hoeven et al., 2002), citrus (Forment et al., 2005), peach (Horn et al., 2005), and apple (Newcomb et al., 2006), as well as use of EST data for functional (Sterky et al., 2004; Park et al., 2006) and comparative (Fulton et al., 2002; Albert et al., 2005; Brenner et al., 2005) genomics studies. Additionally, EST data have also demonstrated their usefulness in enhancing the utility of genetic maps (Naik et al., 2006; Chagné et al., 2008; Shulaev et al., 2008).
In this study, we report on collection and analysis of 182,241 high-quality apple ESTs from different tissues, under different conditions, and from different genotypes. This has vastly expanded on the apple EST database previously reported by Newcomb et al. (2006) with additional tissues, treatments, and genotypes. We have used computational comparisons against Arabidopsis genomic sequences to functionally annotate apple sequences. Comparisons of apple to Arabidopsis and to other tree species, such as citrus and poplar, have revealed a set of genes most likely associated with tree formation. Moreover, this has provided a global overview of the extent to which genes have diverged since apple, Arabidopsis, citrus, and poplar have undergone divergence from their last common ancestor.
Materials and Methods
cDNA libraries were constructed from different tissues, both vegetative and reproductive tissues, and under different biotic and abiotic stresses, of nine apple cultivars, including GoldRush, Royal Gala, Fuji, Braeburn, Suncrisp, Granny Smith, Red Delicious, Jonagold, and Wijcik; three apple rootstocks, including M.9, M.111, and Geneva 3041; and one interspecific hybrid, M. × domestica cv. Geneva 3041 × M. sieversii (Ledeb.) M. Roem. (Table 1). Tissues used for library construction were collected from trees grown at the University of Illinois, Urbana, or from greenhouse-grown plants subjected to various biotic (Cornell University and USDA-ARS, Geneva, NY) or abiotic (USDA-ARS, Kearneysville, WV) stresses.
|Library code||Source||Library strategy||Apple cultivar|
|Mdas||Leaf tissue challenged with Venturia inaequalis||Primary||M. × domestica cv. GoldRush|
|Mdbd||Mixed-bud stages||Normalized||M. × domestica cv. GoldRush|
|Mdfb||Leaf tissue challenged with Erwinia amylovora||Primary||M. × domestica cv. Red Delicious|
|Mdfr||Fruit – 9 DAP||Primary||M. × domestica cv. GoldRush|
|Mdfbg||Leaf tissue challenged with Erwinia amylovora||Primary||Apple rootstock Geneva 3041|
|Mdfrb||Fruit – 36 DAP||Primary||M. × domestica cv. Braeburn|
|Mdfrf||Fruit – 36 DAP||Primary||M. × domestica cv. Fuji|
|Mdfrg||Fruit – 36 DAP||Primary||M. × domestica cv. Granny Smith|
|Mdfrj||Fruit – 36 DAP||Primary||M. × domestica cv. Jonagold|
|Mdfrs||Fruit – 36 DAP||Primary||M. × domestica cv. Suncrisp|
|Mdfrt||Mixed-fruit stages||Normalized||M. × domestica cv. GoldRush|
|Mdfw||Mixed-floral stages||Normalized||M. × domestica cv. GoldRush|
|Mdfwb||Flower balloon stage||Primary||M. × domestica cv. Braeburn|
|Mdfwf||Flower balloon stage||Primary||M. × domestica cv. Fuji|
|Mdfwg||Flower balloon stage||Primary||M. × domestica cv. Granny Smith|
|Mdfwj||Flower balloon stage||Primary||M. × domesticac cv. Jonagold|
|Mdfws||Flower balloon stage||Primary||M. × domestica cv. Suncrisp|
|Mdlr||Leaf challenged with leaf roller insect||Primary||M. × domestica cv. GoldRush|
|Mdltb||Bud tissue exposed to low temperature||Primary||M. × domestica cv. Royal Gala|
|Mdltl||Leaf tissue exposed to low temperature||Primary||M. × domestica cv. Royal Gala|
|Mdltx||Xylem exposed to low temperature||Primary||M. × domestica cv. Royal Gala|
|Mdlv||Leaf – Stage I||Primary||M. × domestica cv. GoldRush|
|Mdlv2||Leaf – Stage II||Primary||M. × domestica cv. GoldRush|
|Mdlv3||Leaf – Stage III||Primary||M. × domestica cv. GoldRush|
|Mdlv4||Leaf – Stage IV||Primary||M. × domestica cv. GoldRush|
|Mdrta||Root tissue||Primary||Apple rootstock M.9.|
|Mdrtb||Root tissue||Primary||Apple rootstock M.111|
|Mdrtc||Root tissue||Primary||Apple rootstock Geneva 3041|
|Mdrtp||Root tissue challenged with Phytophtora cactorum||Primary||M. sieversii × Geneva 3041|
|Mdst||Mixed-shoot stages||Normalized||M. × domestica cv. GoldRush|
|Mdstw||Actively growing shoot||Primary||M. × domestica cv. Wijcik|
|Mdwdb||Bud tissue exposed to water deficit||Primary||M. × domestica cv. Royal Gala|
|Mdwdl||Leaf tissue exposed to water deficit||Primary||M. × domestica cv. Royal Gala|
|Mdwdr||Root tissue exposed to water deficit||Primary||M. × domestica cv. Royal Gala|
Synthesis of cDNA Libraries
Apple primary cDNA libraries were constructed using two different approaches, and normalization was conducted for four of these libraries.
For the first approach, used for constructing 23 libraries (Table 1), the following steps were used. Total RNA was extracted from frozen tissues using a modified cetyltrimethyl ammonium bromide method (Gasic et al., 2005). Poly(A)+ mRNA was isolated twice from total RNA from each stage using the Oligotex Direct mRNA kit (Qiagen, Valencia, CA). mRNA was reverse-transcribed into double-stranded cDNA using a modified oligo18(dT) primer with an identifying tag sequence (Table 2). For those four libraries that were normalized, cDNAs from different stages were pooled in equal amounts before adaptor ligation. cDNA libraries were constructed following procedures described in Bonaldo et al. (1996). Double-stranded cDNAs were size-selected to enrich for molecules >500 bp, EcoRI adapters (Promega, Madison, WI) ligated at both ends, and then digested with NotI. The cDNAs were then directionally cloned into EcoRI (5′)–NotI (3′) digested pBluescript II SK(+) phagemid vector (Stratagene, Cedar Creek, TX), and electoporated into ElectroMax DH10B cells (Invitrogen Life Technologies, Carlsbad, CA) to generate the primary library. Four libraries were further normalized following the procedure described by Soares et al. (1994). Purified plasmid DNA from the primary library was converted to single-stranded circles and used as a template for polymerase chain reaction (PCR) amplification using the T7 and T3 priming sites flanking cloned cDNA inserts. Purified PCR products, representing the entire cloned cDNA population, were used as drivers for normalization. Hybridization between a single-stranded library and PCR products was performed for 44 h at 30°C. Unhybridized single-stranded DNA circles were hydroxyapatite-purified from hybridized DNA rendered partially double-stranded, converted to double-stranded DNA, and electroporated into ElectroMax DH10B cells (Invitrogen) to generate the normalized library.
|Tag sequence||Tag identification from 5′ end||Tag identification from 3′ end|
|A||Insert 18(A)TCGTG||CACGA18(T) insert|
|B||Insert 18(A)TGCTG||CAGCA18(T) insert|
|I||Insert 18(A)TCGGT||ACCGA18(T) insert|
|J||Insert 18(A)TGCGA||TCGCA18(T) insert|
|K||Insert 18(A)TCGGA||TCCGA18(T) insert|
|H||Insert 18(A)TGCGT||ACGCA18(T) insert|
A second approach was used for constructing an additional 11 primary cDNA libraries (Table 1). Total RNA was extracted from freeze-dried tissue using either a standard phenol–chloroform extraction method (leaf tissue stage 1) or a modified method (all other tissues) as described by Wang and Vodkin (1994). poly(A)+ mRNA (4–6 μg) was isolated from 1.0 mg of total RNA using the PolyATtract mRNA Isolation System III (Promega).
Libraries from leaf tissue stage 1 and fruit tissue 9 DAP were constructed using the following methodology: complementary DNA was synthesized from mRNA using an anchored Poly (dT) sequence with a NotI restriction site. SalI linker adapters were ligated to blunt-ended cDNA fragments followed by restriction with NotI. The cDNA fragments were directionally cloned into the NotI–SalI restriction site of the pSPORT 1 vector (Invitrogen). Libraries from leaf tissue stages 2 and 3 were prepared by synthesizing complementary DNA using a hybrid oligo (dT) linker-primer containing an XhoI restriction site. EcoRI adapters were ligated to blunt-ended cDNA fragments, followed by restriction with XhoI. cDNA inserts were protected from XhoI digestion via methylation during the first-strand cDNA synthesis. cDNA fragments were directionally cloned into the EcoRI–XhoI restriction site of the pBluescript II SK(+) XR vector (Stratagene). Ligated cDNA fragments from the two different methodologies were transformed into Escherichia coli ElectroMax DH10B host cells.
Clone Preparation and Sequencing
To confirm presence and size of inserts in both primary and normalized libraries, 192 white colonies per library were randomly hand-picked and grown overnight in 200 μL of YT media supplemented with ampicillin and glycerol in two 96-well plates. A PCR reaction was performed with M13 universal forward and reverse primers to amplify cloned inserts. Agarose gel analysis (1%) of PCR products confirmed presence and average size of cloned fragments (Table 3). Plates were submitted to the W.M. Keck Sequencing Center (University of Illinois at Urbana-Champaign) to verify the quality of cloned fragments by sequencing.
|Sequence assembly||No. of sequences||Avg. length|
|Total high-quality sequences||182,241|
|ESTs in contigs||172,398||ND|
|Total no. of contigs||23,442||865|
|Total no. of apple unique sequences (unigenes)||33,285|
|Number of assembled sequences matching known genes||26,333||ND|
|Number of sequences specific to apple||6,952||ND|
|Avg. insert size||1500|
|Avg. sequence size||465|
All cDNA libraries were then plated, and individual colonies were picked robotically and assigned unique identifiers. Glycerol stocks of cDNA clones for 5′ end-sequencing were sent in 384-well format to the Genome Sequencing Center (Washington University, St. Louis, MO). Clones were then transferred into 96-well blocks and incubated at 37°C for 24 h while shaking at 25 × g (297 rpm) in an incubator shaker. Clones were processed according to Marra et al. (1999) using a high-throughput 96-well microwave protocol. Dideoxy terminator sequencing reactions were conducted as described by Hillier et al. (2006).
Contig assembly was done on “clean” sequences having minimum lengths of 100 nucleotides and minimum quality scores of 20, following vector trimming and discarding of low-quality sequences. The final clean sequences were used for clustering and assembly using Paracel Transcript Assembler (Paracel, Inc., Pasadena, CA). Contaminant sequences like E. coli, mitochondrial, chloroplast, cloning vector, and RNA were filtered during the cleanup stage. Repeat sequences were masked and annotated. The EST sequences were then clustered based on local similarity scores of pairwise comparisons using 88% similarity over 100 nucleotides. Clusters containing only a single sequence were grouped as singlets. The EST clusters were assembled into tentative contigs (contiguous sequence) by multiple-sequence alignment, generating a consensus sequence for each cluster, with criteria of 95% identity and 30-nucleotide overlap. As EST clusters might not share enough similarity over their entire length to be assembled into a single contig, multiple contigs might be generated per cluster. Moreover, multiple contigs might also be generated when ESTs within a cluster represent an alternative splice form of the gene. Those ESTs remaining in a cluster following formation of contigs were designated as cluster_singlets. Unique sequences for each library included contigs, cluster_singlets, and singlets.
Sequence Analysis and Annotation
Putative functions of the apple unique ESTs were classified according to the Gene Ontology (GO) Consortium (2001) scheme. The representation of protein families, domains, and functional sites within apple unique sequences was determined using InterProScan (EMBL, Cambridge, UK). Subsets of apple unique sequences, of at least 300 bp in length, were additionally cataloged into 22 functional categories based on similarity to Arabidopsis [Arabidopsis thaliana (L.) Heynh.] proteins and functional annotation available for Arabidopsis proteins following Munich Information Center for Protein Sequences (MIPS) (http://mips.gsf.de [verified 22 Jan. 2009]) Functional Catalogue (FunCat) schema (Ruepp et al., 2004).
Data Sets Used for Analyses
The apple (M. × domestica Borkh.) unigene set, and the ESTs used for this unigene set build are available on the Apple EST project Web site (http://titan.biotec.uiuc.edu/apple/apple [verified 22 Jan. 2009]) under the Search ESTs (Final Assembly) link. The Arabidopsis thaliana protein sequences are available through the Arabidopsis Information Resource (http://www.arabidopsis.org [verified 22 Jan. 2009]). All UniGene data sets for citrus, grape (Vitis vinifera L.), pine (Pinus taeda L.), poplar, soybean (Glycine max L.), and tomato are available on National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene [verified 22 Jan. 2009]). Oryza sativa L. protein sequences and the nonredundant protein database are also available on NCBI (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein&cmd [verified 22 Jan. 2009]) and the poplar protein sequences are available at JGI Populus trichocarpa v1.1 (http://genome.jgi-psf.org/Poptr1_1/Poptr1_1.home.html [verified 22 Jan. 2009]).
Synthesis and Sequencing from Normalized and Primary Apple cDNA Libraries
A total of 34 directionally cloned apple cDNA libraries were constructed from six different tissues (shoot, bud, leaf, flower, fruit, and root); six treatments (biotic and abiotic stresses); and 13 genotypes, including nine cultivars (Braeburn, Fuji, GoldRush, Granny Smith, Jonagold, Red Delicious, Royal Gala, and Suncrisp); three rootstocks (M.9, M.111, and Geneva 3041); and one interspecific hybrid (M. × domestica cv. Geneva 3041 × M. sieversii) (Table 1). Most libraries (11) were constructed from different tissues of the cultivar GoldRush, representing 67% of the total ESTs. Both primary and normalized cDNA libraries were constructed from GoldRush. Seven primary libraries, corresponding to four developmental stages of apple leaf, first developmental stage of apple fruit, and two pest-challenged (apple scab and obliquebanded leafroller [Choristoneura rosaceana (Harris)]) leaf tissues, were constructed. Furthermore, four normalized libraries from flower, fruit, bud, and stem apple tissues were constructed. Each normalized library, encompassing three to six developmental stages, was evenly represented, and 3′-labeled with an identifying tag sequence (Tables 1 and 2). To increase the likelihood of SNP detection, 10 additional primary libraries from flower (balloon stage) and fruit (9 d after pollination), were constructed from five apple cultivars, including Braeburn, Fuji, Granny Smith, Jonagold, and Suncrisp, and ∼6,000 clones were sequenced from each library (Table 1). Sequences from root tissue of apple rootstocks and tissues subjected to the two treatments, mentioned above, contributed 6% each to the total apple EST pool (Table 1).
From these libraries, 190,425 clones were partially sequenced from the 5′ end and 42,619 clones were sequenced from the 3′ end. A total of 182,241 high-quality ESTs, of at least 100 bp in size and with an average clean sequence length of 465 bp, were obtained (Table 3). These ESTs assembled into 23,442 tentative consensus sequences and 9843 singletons, thus representing 33,285 apple unique sequences (Table 3). The contig sequence length ranged from 101 to 4043 bp with an average of 865 bp, while the singleton length ranged from 100 to 1344 bp with an average of 441 bp (Fig. 1).
The largest number of ESTs sequenced from a single library originated from the normalized fruit library (26,917), representing cDNAs from six stages of fruit development. This was followed by the normalized flower library (22,237), representing cDNAs from four stages of flower development.
Functional Annotation of an Apple Unigene Set
Annotation of the EST-derived apple unigene set was performed on the basis of the existing annotation available for the proteome of Arabidopsis. tBLASTX was used to screen the entire apple unigene set against a subset of the Arabidopsis proteome to which functional categories have been assigned (Arabidopsis Genome Initiative, 2000) (http://www.arabidopsis.org). Apple unigenes with expected values (E-values) of E ≤1.0 × 10−5 were assigned to corresponding Arabidopsis annotation. This approach is based on the assumption that functionality is transferable, based on sequence conservation, for which there are many exceptions. These annotations followed the gene ontology (GO) vocabularies (Gene Ontology Consortium, 2001) [www.geneontology.org (verified 22 Jan. 2009)] as well as The Arabidopsis Information Page Arabidopsis anatomy and developmental stage ontologies (Berardini et al., 2004). The GO terms were organized into three categories representing molecular functions, biological processes, and cellular components (Gene Ontology Consortium, 2001). The sum of apple unigenes per category did not add up to 100%, as some apple unigenes were classified into more than one category.
Evaluation of the primary BLAST matches revealed the presence of two major groups of apple unigene sequences with various potentials for predicting their cellular functions. The first group consisted of apple unigene sequences matching sequences of known proteins (E <1.0 × 10−5), a complement of 26,333 unigene sequences and accounting for 79% of the total unigene set, and were likely to be transcripts of genes having similar functions (Table 3). The GO vocabulary was used to assign functions to this group. The second group consisted of 6952 apple unigene sequences, accounting for 21% of the total apple unigene set, with no matches in the GenBank database. These were deemed to be apple specific or genes most likely associated with tree formation (Table 3).
Of the total apple unigene set, 13,980 (42%) unigenes were annotated into the Molecular Function GO category (describing the biochemical activity performed by the gene product), 10,694 (32%) unigenes into the Biological Process GO category (describing the ordered assembly of more than one molecular function), and 13,338 (40%) into the Cellular Component GO category (describing subcellular compartments of a cell) (Fig. 2A). Among the molecular functions, the most highly represented categories were the catalytic activity (GO:0003824) (53%), binding (GO:0005488) (49%), transporter activity (GO:0005215) (9%), and transcription regulation activity (GO:0030528) (9%) (Fig. 2B). Among the biological processes, the largest proportion of functionally assigned unigenes fell into the metabolic (GO:0008152) and cellular processes (GO:0009987), 75% each, while localization (GO:0051179), biological regulation (GO:0062007), and response to stimulus (GO:0050896) comprised 14, 12, and 12% of the unigenes, respectively (Fig. 2C). For the cell component category, almost all unigene sequences were annotated into cell (GO:0005623)–cell part (GO:0044464) subcategory (99%), and 38% of these were annotated to membrane (GO:0016020) and 64% to organelle (GO:0043226) subcategories (Fig. 2D). Together, all three GO categories accounted for ∼79% of the assigned apple unigene set.
In addition, a subset of 1982 apple unique sequences (of at least 300 bp in length), was translated into 23,076 open reading frames ≥50 amino acids and assigned to 22 functional categories based on functional annotations available for Arabidopsis proteins following the MIPS (http://mips.gsf.de) FunCat schema (Ruepp et al., 2004). Only 10% of apple sequences from the subset did not have a match in Arabidopsis. Half (44.79%) of those sequences that have matches are similar to unclassified proteins in Arabidopsis. Of classified proteins in the apple subset, the Metabolism category contained the highest number of genes (7.61%), and this concurred with previous observations for apple and Arabidopsis (Newcomb et al., 2006) (Table 4).
|No.||Functional category||Apple unique sequences|
|10||Cell cycle and DNA processing||1.34|
|16||Protein with binding function or cofactor requirement||2.77|
|18||Protein activity regulation||0.22|
|30||Cellular communication/signal transduction mechanism||4.11|
|32||Cell rescue, defense, and virulence||2.77|
|34||Interaction with the cellular environment||0.43|
|36||Interaction with the environment||0.82|
|42||Biogenesis of cellular components||4.15|
|73||Cell type localization||0.09|
|98||Classification not yet clear-cut||5.71|
Comparison of apple unique sequences with InterPro protein family database (Zdobnov and Apweiler, 2001; Mulder et al., 2007), performed to determine representation of protein families, domains, and functional sites, revealed matches to 2425 InterPro families. The InterPro families with the most frequent representation in the apple unique sequences data set are presented in Table 5.
|IPR001245||Tyr protein kinase||297|
|IPR002290||Ser-Thr protein kinase||257|
|IPR000504||RNA recognition motif||250|
|IPR008271||Ser-Thr protein kinase, active site||227|
|IPR001680||G-protein β WD-40 repeat||196|
|IPR000504||RNA-binding region RNP-1 (RNA recognition motif)||186|
|IPR001841||Zinc finger, RING||173|
|IPR014778||MYB DNA-binding domain||104|
|IPR001471||Pathogenesis-related transcriptional factor and ERF||87|
|IPR001344||Chlorophyll a/b-binding protein||86|
|IPR013753||Ras GTPase superfamily||86|
|IPR000571||Zinc finger, C-x8-C-x5-C-x3-H type||80|
|IPR001993||Mitochondrial substrate carrier||74|
|IPR001878||Zinc finger, CCHC type||70|
|IPR014045||Protein phosphatase 2C,N-terminal||62|
|IPR001087||Lipolytic enzyme, G-D-S-L||59|
|IPR002016||Haem peroxidase, plant/fungal/bacterial||59|
|IPR001092||Basic helix-loop-helix (bHLH) dimerization domain bHLH||57|
|IPR001623||Heat-shock protein DnaJ, N terminus||57|
|IPR006121||Heavy metal transport/detoxification protein||56|
|IPR013057||Amino acid transporter, transmembrane||53|
|IPR007087||Zinc finger, C2H2 type||49|
|IPR000916||Bet v I allergen||46|
|IPR013126||Heat-shock protein 70||40|
|IPR000425||Major intrinsic protein||38|
|IPR002130||Peptidyl-prolyl cis-trans isomerase, cyclophilin type||33|
|IPR007493||Protein of unknown function DUF538||28|
|IPR001938||Thaumatin, pathogenesis related||24|
Protein kinases (IPR000719), with 1040 apple unigenes, are the most abundant families. Detailed analysis of transcription factors, via automated predictions based on comparisons to the InterPro database, identified the MYB transcription factor family as the most common in apple sequences (Table 6).
|Top 10 TF family descriptions||No. apple unigene sequences||InterPro accession nos.||TF family rank|
|MYB||228||IPR014778||1 (1)||1, 11, 14||1, 9|
|Pathogenesis related||87||IPR001471||2 (2)||2||2|
|C2H2 Zn finger||52||IPR007087||3 (3)||6||7, 8, 10|
|C2C2 Zn finger||62||IPR000315||5 (5)||5||3|
|Basic helix-loop-helix||61||IPR001092||6 (7)||3||ND|
|C3H-type 1 Zn finger||43||IPR000571||7 (8)||18||ND|
|Total no. TFs||1091||–||1, 470||1306|
Comparisons of the Apple Unigene Set with Those of Other Plant Species
Large-scale EST and genomic sequences databases from multiple plant organisms are now available in public databases. With available Unigene or Proteome databases for multiple plant species, it is possible to explore relationships and differences among these species. Therefore, the apple unigene set has been compared with available Unigene or Proteome collections.
Proteome collections from seven angiosperm species (Tree of Life [ToL]; http://tolweb.org/tree/phylogeny.html [verified 22 Jan. 2009]) were subjected to tBLASTX to identify sequences encoding similar proteins (Table 7). It was revealed that six of these species belong to eudicots (clade rosids: soybean, poplar, Arabidopsis, and citrus; clade Vitaceae (grape); and clade asterids [tomato]), and one belonged to a monocot (rice). In addition, apple sequences were compared to those of pine, a tree species that is of furthest lineage to apple, and to the nonredundant (nr) proteome database. As database collections for most of the selected plant species are continuously expanding, estimations of similarities and/or differences presented herein should be considered tentative.
|Species||Database||No. of hits||Similarity|
The percentages of apple sequences that did not show any similarities to other databases varied between 21% in the nr and 60% in the citrus database. The poplar predicted proteome, along with Arabidopsis and rice proteome databases, produced the highest numbers of matches with the apple unigene set, 77, 75, and 71%, respectively (Table 7). The observed high level of sequence similarity between apple and each of poplar and Arabidopsis is in agreement with their position on the ToL. Interestingly, comparable levels of similarity, 40 to 56%, were observed between apple sequences and those of other plant species, such as citrus, tomato, pine, soybean, and grape, and irrespective of their phylogenetic relationships to apple (Albert et al., 2005). This observation was greatly influenced by the size of the database as well as the type of sampled tissue for the given plant species at the time of comparison.
With the availability of the complete genome sequence of Arabidopsis along with other ongoing major genomic efforts for various plant species (Arabidopsis Genome Initiative, 2000; Van der Hoeven et al., 2002; Brenner et al., 2005; Forment et al., 2005; International Rice Genome Project, 2005; Moser et al., 2005; Velasco et al., 2008), it is now feasible to perform comparative studies between highly divergent genomes by screening large EST databases of different plant species against the Arabidopsis genomic sequence (Fulton et al., 2002; Albert et al., 2005).
To evaluate the relationship between apple and Arabidopsis, a computational approach was used as described by Van der Hoeven et al. (2002). General trends in gene conservation and functionality between apple and Arabidopsis were evaluated by screening all apple unigenes in all possible translational frames (tBLASTX) against the complete Arabidopsis genomic sequence (http://www.arabidopsis.org), and resultant E-values of BLAST similarity searches were used as estimates of sequence conservation. As pointed out by Van der Hoeven et al. (2002), two factors could have effects on such an assumption: sequence length and type of analysis performed. Many of the unigene sequences were not full-length sequences, thereby lowering potential E-values. BLAST analysis conducts local alignments, resulting in high E-values over short stretches of sequence conservation, thus favoring conservation of domains but not complete genes. Given our primary intent to use E-values to reveal general trends in the conservation of sequences and their functionality, we think that such drawbacks are unlikely to affect our overall conclusions.
Approximately 75% of apple unigene sequences had significant matches at the amino acid level (E <1.0 × 10−5) to one or more translated portions of the Arabidopsis genomic sequence. The highest proportion of apple unigenes (30%) with matches to Arabidopsis genome fell into categories with strong homology to their Arabidopsis counterparts (E <1.0 × 10−50 to >1.0 × 10−100) (Fig. 3A).
A closer examination of the putative functional role and the degree of sequence similarity (to its closest Arabidopsis counterpart) of each apple unigene was performed to further analyze the nature of both fast- and slow-evolving genes identified by apple–Arabidopsis comparisons (Fig. 3B). This would likely provide further insight into types of genes–gene functions that were either more stable across plant taxa or those likely to have evolved more rapidly as species evolved (Van der Hoeven et al., 2002). Of the 24,848 apple unigenes, ∼50% showed high (E <1.0 × 10−50), ∼40% intermediate (E ≤1.0 × 10−15 to E ≥1.0 × 10−50), and ∼10% low (E >1.0 × 10−15) levels of conservation with Arabidopsis genes (Fig. 3A).
Detailed analysis of assigned putative functional roles revealed that within the “slow-evolving” category, the highest proportion of genes were annotated in metabolism and biosynthesis categories, 61 and 60%, respectively (Fig. 3B). The frequency of genes assigned to these two categories decreased as we moved to the “intermediate-evolving” (E ≤1.0× 10−15 to E ≥1.0 × 10−50; 31 and 32%, respectively) and the “fast-evolving” (E > 1.0 × 10−15; 8 and 9%, respectively) categories, thus suggesting that genes involved in metabolism and biosynthesis remained highly conserved during plant evolution between apple and Arabidopsis (Fig. 3B). A similar trend was observed for genes involved in transport, signaling, intracellular, and membrane functions. However, genes encoding transcription and transcription factor activity appeared to be in transition toward faster evolving categories, changing from 30% in the slow-evolving category to 50 and 19% in the intermediate- and fast-evolving categories (Fig. 3B). Interestingly, genes annotated in catalytic activity appeared only in the fast-evolving category, suggesting the furthest divergence between apple and Arabidopsis. Genes involved in binding and organelle functions (e.g., chloroplast and mitochondria), exhibited similar frequencies in slow- and intermediate-evolving categories, with a slight decline in the fast-evolving category (Fig. 3B).
Identification of Apple-Specific Unigenes
Of 33,285 apple unigenes, 8437 (25%) had no detectable homologs (E >0.1) in the Arabidopsis genome (Fig. 3A). This set of unigenes was further searched against the GenBank protein database to identify putative matches. A small proportion of these unigenes (816, 10%; E <1.0 × 10−15) showed similarities to protein sequences in GenBank. Of those showing homology with one or more GenBank entries, a subset of 457 (56%) unigenes with the most significant matches (E <1.0 × 10−30) were annotated for putative gene functions.
A large proportion (60%) of these 457 unigene sequences revealed perfect matches with either bacterial, viral, fungal, or nonplant sequences. Only 183 of the original 457 unigenes had no detectable counterparts in the Arabidopsis genome but matched genes from other plant species (those available in the GenBank protein database). Twenty-seven of these unigenes (15%) corresponded to 10 gene families that appeared to be specific to Rosaceae, having matches with other rosaceous plants but not with other plant families (Table 8). Six of these gene families were Malus specific, and they included Mal d 1 (nine unigenes assigned), a major food allergen in apple (Son et al., 1999); polyphenol oxidase (three unigenes assigned), known to be involved in browning of damaged fruits (Haruta et al., 1999) and in herbivory resistance (Murata et al., 1997); MADS box proteins (a single unigene assigned), a developmental transcription factor (Leland and Podila, 2004), including a fruit acidity–related protein (Mal-DDNA-DQ417661) (a single unigene assigned) from M. × domestica (Yao et al., 2007); a dehydrin (a single unigene assigned), a class of plant proteins related to drought and cold stress responses (Garcia-Bañuelos et al., 2006; Yao et al., 2007); and AHAP2 transcription factor (a single unigene assigned).
|Match description, species (GenBank no.)||E-value||Length query (amino acids)||No. of isoforms|
|Mal d 1-like, Malus × domestica (AAS00042–AAD00053)||1.0 × 10−160||101–163||9|
|Polyphenol oxidase 2 precursor, Malus × domestica (AAK56323)||1.0 × 10−194||191–587||3|
|Polyphenol oxidase precursor, Prunus armeniaca (AAC28935)||1.0 × 10−108||99–245||2|
|Polyphenol oxidase, Prunus salicina var. cordata (AAW58109)||1.0 × 10−184||385||2|
|Transcription factor AHAP2, Malus × domestica (AAL57045)||1.0 × 10−70||180||1|
|Dehydrin, Malus × domestica||1.0 × 10−59||172||1|
|MADS box protein, Malus × domestica (CAC86183)||1.0 × 10−41||84||1|
|Fruit acidity–related protein, Malus × domestica (Mal-DDNA–DQ417661)||1.0 × 10−37||112||1|
The majority of apple unigenes (90%), with no matches to the Arabidopsis proteome, had matches to species belonging to two clades of angiosperms, eudicots (94%) and monocots (6%) (data not shown).
Comparisons of Apple Unigenes with Those in Citrus and Poplar
Comparison of the apple unigene set with the Arabidopsis gene repertoire provides an overview of gene evolution between these two species since the time of their first divergence from their common ancestor. However, this does not provide insight into the broader context of genes that have differentiated since then and that may hold clue(s) for tree-specific gene evolution. With the advent of genomics efforts in other plant species, especially tree species, such as poplar (Tuskan et al., 2006) and citrus (Forment et al., 2005), both members of the core eudicot clade rosids (Albert et al., 2005), it is now possible to use computational comparisons between poplar and citrus and the apple unigene set to search for genes that may be linked to tree-specific characters.
In an attempt to identify such likely tree-specific genes, the apple unigene set was computationally compared with citrus unigene data set, available in NCBI UniGene database, and with the poplar predicted proteome, available at JGI Populus trichocarpa v1.1 genome site (http://genome.jgi-psf.org/Poptr1_1/Poptr1_1.home.html). The entire apple unigene set was compared with the entire citrus unigene set and the poplar predicted proteome at the amino acid level using tBLASTX (Fig. 4). Almost 40% (13,521) of apple unigenes matched one or more sequences in the citrus UniGene database (Table 7), and ∼20% (2667) of those had a highly significant counterpart in citrus (tBLASTX E <1.0 × 10−50) (Fig. 4A). The proportion of lower conserved sequences between apple and citrus (tBLASTX E ≤1.0 × 10−15 to ≥1.0 × 10−50) increased to 53% (7190), while the category of the “fastest evolving” genes comprised 27% (4115) of apple sequences (Fig. 4A). Apple–Arabidopsis–citrus comparisons revealed 189 apple sequences that had a match only in the citrus unigene set but not in the Arabidopsis proteome (Fig. 4B). Most of these sequences, 83% (157), belonged to a fast-evolving gene category (tBLASTX E >1.0 × 10−20).
Most of the apple unigenes (98%) that had counterparts in the citrus Unigene database could be functionally annotated using the Arabidopsis genomic sequence (http://arabidopsis.org), and their distribution was evenly divided among all three GO categories.
Out of 33,825 apple unigenes, 77% (25,817; tBLASTX E <1.0 × 10−5) had counterparts in the predicted proteome of Populus trichocarpa (Table 7), and ∼50% of those (13,091) had highly significant counterparts in poplar (E <1.0 × 10−50) (Fig. 4A). The proportion of intermediate-evolving sequences between apple and poplar decreased to 37%, while the fast-evolving genes encompassed only 13% of apple sequences (Fig. 4A). Functional annotation of apple–poplar matches successfully assigned putative functional roles to 60% (15,411) of apple–poplar homologs. However, 40% (10,406) of apple sequences with matches in the Poplar JGI protein database could not be annotated through GO (Fig. 4B). Most of these sequences, 77% (8033), are classified in the slow-evolving gene category (tBLASTX E <1.0 × 10−20). Further comparisons of apple sequences, those with counterparts in poplar but not with the Arabidopsis protein database, with the nr protein database available at NCBI, revealed that 3428 (13%) apple–poplar matches had no similarities to any other available protein or nucleotide database. These apple sequences were most likely to contain tree-specific genes because the majority (60%) exhibited high conservation (E <1.0 × 10−20) with poplar proteins (Fig. 4B). In addition, 21% of the proposed tentative tree-specific genes have at least 100 amino acid matches with poplar counterparts, ranging from 900 to 2796 bp in length.
To uncover patterns of conservation and divergence among apple, citrus, and poplar, we computationally compared the apple unigene set with those of both citrus and poplar, and used E-values of putative functional annotations based on the Arabidopsis–apple comparison (Fig. 5). Detailed analyses of assigned putative functional roles for apple–citrus and apple–poplar matches revealed that for almost all categories, at least 50% of apple–citrus homologs belonged to the intermediate-evolving genes and apple–poplar homologs belonged to the slow-evolving category, suggesting a higher conservation between apple and poplar than between apple and citrus (Fig. 5). For example, proportion of genes involved in metabolism and catalytic activity between apple and poplar was highest in the slow-evolving category, 64 and 62%, respectively, and decreased when moving to the intermediate-evolving (E ≤1.0 × 10−15 to E ≥1.0 × 10−50, 28 and 29% of unigenes, respectively) and the fast-evolving (E > 1.0 × 10−15; 8 and 10% of unigenes, respectively) categories, suggesting that metabolism and catalytic activity remained highly conserved in plant evolution between apple and poplar (Fig. 5). However, genes encoding biosynthesis activity appeared to be transitioning to faster evolving groups between apple and poplar, changing from 19% in the slow-evolving category to 62 and 19% in the intermediate- and fast-evolving categories, respectively (Fig. 5). As for the apple–citrus evolutionary divergence, it seems that most genes that are homologous between those two species belong to the intermediate-evolving category, suggesting further divergence between apple and citrus than that between apple and poplar (Fig. 5).
Comparisons of the apple–citrus and apple–poplar matches revealed 40% of apple unigenes with matches in both databases. Approximately 99% of apple–citrus homologs also have counterparts in the poplar proteome, while 48% (12,402 apple unigenes) of apple–poplar homologs do not have counterparts in citrus.
Large-scale single-pass sequencing of cDNA clones randomly picked from libraries is a very powerful approach for gene discovery and for providing a global profile of the transcriptional activity within tissues (Adams et al., 1993). However, both reliability of data and frequency of identifying novel sequences depend to a large extent on the quality of the constructed cDNA libraries (Bonaldo et al., 1996; Gasic et al., 2005). To increase the discovery of rare transcripts and to reduce the time involved in constructing cDNA libraries, we have combined two different approaches for cDNA library construction. The first approach involved pooling equimolar amounts of cDNAs from different developmental stages of the same apple tissues, and normalization. Using this strategy, we have developed four cDNA libraries from flower, fruit, bud, and stem tissues, comprising 14 developmental stages, from the apple cultivar GoldRush, and generated a total of 63,384 EST sequences. Each cDNA has been tagged with a different 6-nucleotide tag at the 3′ end, thus allowing us to identify the developmental stage of the target tissue. Additionally, 29 nonnormalized or primary libraries have been constructed from six different tissues (bud, shoot, leaf, flower, fruit, and root) and two treatments (biotic and abiotic stresses) from nine apple cultivars (Braeburn, Fuji, GoldRush, Granny Smith, Jonagold, Red Delicious, Royal Gala, and Suncrisp); three apple rootstocks (M.9, M.111, and Geneva 3041); and a single interspecific hybrid (M. × domestica cv. Geneva 3041 × M. sieversii), and generating a total of 118,857 EST sequences. The majority of the ESTs (67%) originated from tissues collected from a late-ripening yellow-colored fruiting apple cv. GoldRush, which has excellent fruit quality and long storageability combined with field immunity to apple scab disease, high level of resistance to apple powdery mildew [Podosphaera leucotricha (Ellis and Everh.)], and moderate resistance to the bacterial disease fire blight (Crosby et al., 1994).
Clustering of high-quality sequences reduced the number of ESTs to 33,285 apple unique sequences, comprising 23,442 tentative consensus sequences and 9843 singletons. Analysis of the overall contribution of the libraries to the data set showed that no single library contained >7% of the total number of singletons, suggesting that most of the diversity was derived by sequencing different sources of tissues, which was similar to recent findings by Newcomb et al. (2006). The reproductive tissues, consisting of 13 libraries constructed from six different genotypes, comprised 28% of all apple ESTs and provided the highest contribution (40%) to the apple unigene set. However, a high redundancy was observed in the normalized fruit library, and this was attributed to tissue sampling. The normalized fruit library comprised all six stages of apple fruit development, while the nonnormalized fruit library was derived from the first developmental stage, young fruitlets (9 d after pollination). Additionally, during the last three stages of fruit tissue collection, maturity stages I and II as well as ripe fruit, the amount of RNA in these tissues was very low. This suggested that during this period of fruit development, gene expression was low as cell–tissue differentiation has been completed by then, while ongoing cell expansion was accompanied by sugar and starch accumulation.
While taking into account the number of apple sequences used for assembly, the total number of apple unigenes obtained in this study, 33,825, is comparable to recently published estimates by Newcomb et al. (2006) and Park et al. (2006). It is likely that this is an overestimate of the actual number of apple genes present in the apple genome. Using the analogy of Arabidopsis (Arabidopsis Genome Initiative, 2000), Newcomb et al. (2006) have estimated the total number of apple genes to be ∼27,000. However, a more accurate estimate of the total number of genes in the apple genome can be made by comparing the size of the EST-derived unigene set and the percentage of predicted genes in genomic DNA (e.g., BAC sequences) that are represented by a unigene match. Recently, a first draft of the physical map of the apple genome has been constructed in our laboratory (Han et al., 2007), and efforts are underway to anchor this physical map to the genetic map. In addition, sequencing of the whole apple genome is also underway (Shulaev et al., 2008; Velasco et al., 2008). Therefore, these new genomic resources would eventually provide a more accurate accounting of the number of genes present in the apple genome.
Computational comparisons of apple unigenes against the Arabidopsis and the nr proteome database have allowed for identification of putative homologous protein sequences and assignment of putative functional roles to 75 and 79% of our transcripts, respectively. This is similar to those found in grape (Moser et al., 2005) and in other woody perennial plant species, such as peach (Horn et al., 2005), citrus (Bausher et al., 2003; Forment et al., 2005), and poplar (Sterky et al., 2004). The remaining 21% of our sequences, having no matches to any sequences in public databases, may represent apple-specific genes. Using a similar approach, Van der Hoeven et al. (2002) have been able to assign putative functions to only 30% of tomato unigenes, while Newcomb et al. (2006) have reported that only 6% of apple nonredundant sequences do not have matches in Arabidopsis. These observed differences may be attributed to differences in E-value thresholds as well as the depth of EST samplings whereby an E-value of <1.0 × 10−10 in tomato has been used compared to an E-value of <1.0 × 10−5 in apple; moreover, a sampling of ∼150,000 apple ESTs has been used by Newcomb et al. (2006) compared to ∼190,000 apple ESTs used in this study.
The GO classification of apple–Arabidopsis matches showed similar distribution of apple unigenes among the three categories, molecular function, biological process, and cellular component. In addition, representatives have been found in every major putative functional role, thus indicating that a genome-wide EST collection has been generated. Furthermore, distribution of functionally annotated apple unigenes resembles that of the full set of proteins in Arabidopsis. These findings are similar to those reported for citrus–Arabidopsis comparisons (Forment et al., 2005). Furthermore, methods of predictive bioinformatics, such as comparison with MIPS-based role classification of Arabidopsis, and matches to the InterPro protein family, have also been employed to elucidate the function of encoded proteins predicted from apple unigenes. Among the most frequently represented class of genes were the protein kinases, followed by leucine-rich repeat (LRR) and RNA recognition motif proteins. In general, our findings are in agreement with previously reported distribution of ∼43,000 apple nonredundant sequences to the InterPro protein families by Newcomb et al. (2006). However, small discrepancies attributed to differences in genotype–tissue–treatment sources for cDNA development were noted. For example, high numbers of sequences in InterPro classes that were potentially involved in disease resistance were detected in both studies; for LRR class of proteins (IPR001611), 321 were found by Newcomb et al. (2006), and 380 were found in this study. However, we detected twice as many sequences in the protein kinase (IPR000719) class than Newcomb et al. (2006) group (1040 vs. 564), and failed to detect apple unigene sequences in either NBS (nucleotide binding sites)–LRR or plant-specific LRR protein classes. In addition, other functional classes of proteins, such as putative transcription factors, were identified in both databases. Comparisons of the frequency of the most common transcription factor families in both data sets and with Arabidopsis (Riechmann et al., 2000) and rice (Goff et al., 2002) revealed similar rankings.
Comparisons of the apple unigene set with those of other plant species, available in the NCBI UniGene database, have shown various levels of similarity (Table 7). As expected, the highest level of similarity is observed with the poplar and Arabidopsis proteomes as well as with the rice protein database. The observed high level of sequence similarity between apple and each of poplar and Arabidopsis is in agreement with their position on the ToL, with apple and poplar belonging to eurosid I and Arabidopsis to eurosid II clades. On the other hand, the observed high sequence similarity between apple and rice does not agree with their placement on the ToL, but this is attributed to the available amount of sequence data and genes involved in basic metabolic pathways that have remained conserved among plant species. However, significant levels of similarities are also observed with several other plant species having different phylogenetic relationships with apple, including soybean, citrus, grape, and tomato; although all are eudicots, they belong to different families, including Fabaceae, Rutaceae, Vitaceae, and Solanaceae, respectively (Fulton et al., 2002; Albert et al., 2005). Phylogenic trees of species of the Floral Genome Project (Albert et al., 2005) indicate that there is a relatively close evolutionary relationship between apple and Arabidopsis. Thus, an evaluation of the general trends in gene conservation and functionality between apple and Arabidopsis has been initiated to reveal trends in both gene and genome divergences between these two species. Significant matches to Arabidopsis genes, likely exhibiting conserved gene functions, have been identified for 30% of apple unigenes. Similar to findings in tomato (Van der Hoeven et al., 2002), the majority of apple unigenes (80%) with no matches in Arabidopsis have unknown functions and are without matches in other genome databases. Hence, these may represent fast-evolving genes that have acquired new functions in apple and related taxa. The majority of these novel genes, such as Mald1, are confined to apple and to other rosaceous species. Additionally, an assessment of the apple gene content provides evidence for selective gene loss in the Arabidopsis pedigree which has been previously identified from the tomato–Arabidopsis comparison (Van der Hoeven et al., 2002). Examples of such selective losses include polyphenol oxidases that are present in apple, tomato, and in many other plant species, but not in Arabidopsis. If Arabidopsis and tomato lineages have diverged ∼100 to 150 million years ago during the evolution of flowering plants (Yang et al., 1999), we can speculate that apple and Arabidopsis lineages must have diverged from their common ancestor at a later date, ∼75 to 100 million years ago (Albert et al., 2005). Thus, this suggests that the loss of the polyphenol oxidase gene function in Arabidopsis must have occurred sometime following its divergence from apple.
Comparison of the apple unigene set with the Arabidopsis gene repertoire provides an overview of gene evolution between those two species since their early divergence from a common ancestor. However, this does not provide any insight into genes that have differentiated since then, particularly those that are lost from the Arabidopsis lineage and are likely to hold clues into tree-specific gene evolution. The majority of apple–citrus (13,352) and apple–poplar (25,817) matches also exhibit similarities with the Arabidopsis gene repertoire, 98 and 73%, respectively. Moreover, the majority of apple sequences present only in citrus (53%) but not in Arabidopsis belong to the fast-evolving gene category, and range in length from 168 to 2,058 bp. Conversely, 71% of apple sequences that have counterparts in the poplar predicted proteome but not in Arabidopsis belong to the slow-evolving category (E <1.0 × 10−20). Further search for functional annotation of apple–poplar matches against the nr protein database has revealed ∼3,000 apple sequences having counterparts only in poplar but not in any other available protein or nucleotide database. These apple unigenes may represent tree-specific genes, and their functional roles should be explored. Phylogenetically, apple and poplar belong to eurosid I, while citrus belongs to eurosid II, both clades of rosids. The rosids represent the largest of eight major clades of core eudicots, and include nearly one-third of all flowering plants. Single- and multigene phylogenies of rosids have identified seven major clades, and although relationships among these clades remain unresolved, DNA-based studies support its monophylogeny (Savolainen et al., 2000a, 2000b; Soltis et al., 2000, 2003; Hilu et al., 2003; Ravi et al., 2007). Thus, our “tree-specific gene set” assumption that the state of the “tree-form” is monophyletic within apple, poplar, and citrus species grouping is valid. Genes involved in basic metabolic pathways appear to be largely conserved among apple, citrus, poplar, and Arabidopsis. This finding is consistent with those for Arabidopsis, tomato, and Medicago truncatula Gaertn. (Van der Hoeven et al., 2002), and further supports the hypothesis that basic metabolic pathways remain conserved among plant species. However, genes encoding transcription factors among apple, citrus, poplar, and Arabidopsis are largely present in less conserved categories, and their frequencies seem to double when moving from slow-evolving to fast-evolving categories (Fig. 3 and 5). These appear to diverge more rapidly among plant species, thus suggesting that changes in gene regulation present a significant force in plant evolution (Doebley and Lukens, 1998; Stern, 2000). The observed evolutionary divergences among apple, poplar, citrus, and Arabidopsis correspond to their phylogenetic relationships (Albert et al., 2005), with higher conservation observed between apple and poplar than between apple and either citrus or Arabidopsis.
In summary, we present an extensive set of ESTs, derived from various genotypes, tissues, and treatments, and contributing to the overall value of these publicly available apple sequences. This data set has been used as a rich source of SSR and SNP marker development, for comparative genomics studies, and for creating an apple microarray useful for functional genomics studies and for characterizing genes involved in various biological processes.