Assembly and analysis of the complete Salix purpurea L. (Salicaceae) mitochondrial genome sequence

Plant mitochondrial (mt) genomes possess several complex features, including a variable size, a dynamic genome structure, and complicated patterns of gene loss and gain throughout evolutionary history. Studies of plant mt genomes can, therefore, provide unique insights into organelle evolution. We assembled the complete Salix purpurea L. mt genome by screening genomic sequence reads generated by a Roche-454 pyrosequencing platform. The pseudo-molecule obtained has a typical circular structure 598,970 bp long, with an overall GC content of 55.06%. The S. purpurea mt genome contains 52 genes: 31 protein-coding, 18 tRNAs, and three rRNAs. Eighteen tandem repeats and 404 microsatellites are distributed unevenly throughout the S. purpurea mt genome. A phylogenetic tree of 23 representative terrestrial plants strongly supports S. purpurea inclusion in the Malpighiales clade. Our analysis contributes toward understanding the organization and evolution of organelle genomes in Salicaceae species. Electronic supplementary material The online version of this article (doi:10.1186/s40064-016-3521-6) contains supplementary material, which is available to authorized users.


Background
Mitochondria contribute to energy metabolism and play fundamental roles in plant development, fitness, and reproduction, as well as being associated with the biosynthesis of fatty acids and several active proteins (Mcbride et al. 2006;Ryan and Hoogenraad 2007). The mitochondrial (mt) genome has drawn increased attention during the genomic and now post-genomic eras owing to its maternal pattern of inheritance and unique evolutionary features, and is often used for the phylogenetic study of plants (Gualberto et al. 2014;Dames et al. 2015). Plant mt genomes can be extraordinarily larger than animal mt genomes, and vary significantly in size, even between very closely related species or within a single family (Alverson et al. 2010), whereas animal mt genomes, are conserved and relatively uniform in size Liu et al. 2013). More than 100 complete land plant mt genome sequences are available through the NCBI Organelle Genome Resources Web site (http://www.ncbi.nlm.nih.gov/genome/organelle/), ranging in size from 100,725 bp (Buxbaumia aphylla; GenBank accession number NC_024518) (Liu et al. 2014) to 1555.93 Kb (Cucumis sativus; GenBank accession number NC_016005) (Alverson et al. 2011b), since the first angiosperm mt genome nucleotide sequence was determined in 1997 (Arabidopsis thaliana; NC_001284) (Unseld et al. 1997). The comparative analysis of plant mt genomes enhances our understanding of genome rearrangement and DNA transfer mechanisms, and of phylogenetic diversity.
Salix purpurea L. is a willow species native to much of Europe (north to the British Isles, Poland, and the Baltic States), western Asia, and North Africa (Argus 1997;Skvortsov 1999;Sulima et al. 2009). It is a deciduous shrub growing 1-3 m tall, with purple-brown to yellow-brown shoots, green foliage, and small purple or red catkins produced in the early spring. S. purpurea has frequently been cultivated for its commercially important biomass. Purple willow bark contains a particularly valuable raw material traditionally used for the production of natural aspirin and other salicylic glycosides with Open Access *Correspondence: yening@njfu.edu.cn 2 The Southern Modern Forestry Collaborative Innovation Center, Nanjing Forestry University, Nanjing 210037, Jiangsu, China Full list of author information is available at the end of the article analgesic, antipyretic, and anti-inflammatory effects (Skrzypczyńska 2001;Hakmaoui et al. 2007;Aliferis et al. 2015).
With the development of next generation sequencing (NGS) technologies, such as the Roche and Illumina platforms, new strategies are being used to characterize plant mitochondrial genomes. The mt genome of carrot , soybean (Chang et al. 2013), rubber tree (Shearman et al. 2014), and some other species (Liu et al. 2013;Rd et al. 2015), have been successfully assembled through a combination approach using shotgun and paired-end NGS sequencing from non-enriched whole genome DNA libraries. Although the S. purpurea chloroplast genome has been published (Carlson et al. 2015), which is important for the genetic improvement and to further the understanding of biological mechanisms in plant species, the complete S. purpurea mt genome has not been previously published, because of its complex structure. In this study, we present the first complete mt genome of S. purpurea. We generated the mt genome sequence from 454 pyrosequencing whole genome big data. The mt genome was sequenced, assembled, and annotated as a circular-mapping DNA molecule. Additionally, we compared the S. purpurea mt genome to several previously published genomes to gain enhanced understanding of the evolution of organellar genomes. The strategy used in this study has broad applicability toward exploring additional mitochondrial genomes, and furthering the investigation of intra-cellular genome interactions and genome evolution.

Plant material
The raw sequencing and alignment data from the S. purpurea genome project is available at the NCBI Genome Resources Sequence Read Archive (SRA) database (http:// www.ncbi.nlm.nih.gov/sra?LinkName=biosample_ sra&from_uid=116760). The raw data were generated using Roche-454 FLX Titanium sequencing from random whole genome shotgun libraries. We deposited three whole genome sequence biosamples (Accessions: SRX029331, SRX029332, SRX029333), which respectively have 1,270,964 spots, 549,435 spots, and 448,379 spots, with total lengths of 1.4 Gb, 658.4 and 539 Mb.

Genome assembly
Our research goal was to produce a gap-free, scaffoldlevel S. purpurea mt genome. Two random genomic 454 sequencing read samples were combined for assembly using the gsAssembler Java GUI in Newbler (version 2.7) with default parameters, producing 50, 115, 25, 100, and 17,094 assembled contigs from five separate runs. The initial contigs are a mixture of DNA from the nucleus and from organelles, therefore, BLASTN (Buhler et al. 2007) was used to isolate mitochondrial contigs from the whole genome reads based on plant mt genomes sequences downloaded from the NCBI Organelle Genome Resources. A total of 5831 contigs, with read depths between 50× and 100×, contained essential mitochondrial genes. We used Perl scripts to visualize contig connections from the Newbler assembly results, which records all contig read depth and connection information. False links to other contigs and a few wrong forks were removed manually, according to the read depth of the contigs. We connected 26 final contigs to produce a circular mt genome consistent with the standard structure of most mitochondrion genomes, and we mapped the sequence to the Populus tremula mt genome (NC_028096). The complete S. purpurea mt genome sequence is 598,970 bp long.

Genome annotation
The S. purpurea mt genome was preliminarily annotated using the online program DOGMA (Organellar GenoMe Annotator) (Wyman et al. 2004) coupled with manual corrections for gene start and stop codons by comparison to homologous genes from other sequenced mt genomes. Subsequently, a detailed annotation of the protein-coding, rRNA, and tRNA genes was performed with a local database containing the nucleotide and protein sequences of all published land plant mitochondrial genomes available through the NCBI Organelle Genome Resources site. We also used tRNAscan-SE (Schattner et al. 2005) with default settings to corroborate tRNA boundaries identified by BLASTN. The circular mt genome map was drawn using Organellar Genome DRAW tool (OGDRAW) (Lohse et al. 2007) for further comparison of gene order and content.

Repeat structure
Tandem repeats in the S. purpurea mt genome were identified using the Tandem Repeats Finder program (Benson 1999) with default settings. The Perl script MISA (Thiel et al. 2003) was used to detect simple sequence repeats (SSRs) with a motif size of one to six nucleotides and thresholds of eight, four, four, three, three, and three, respectively. All repeats identified by the various programs were manually confirmed to remove redundant results.

Phylogenetic analysis
Phylogenetic analysis was performed with the mt genomes of 23 plant species, our newly sequenced S. purpurea mt genome and those from 22 other plant species ( (atp1, atp4, atp6, atp8, atp9, cob, cox1, cox2, cox3, nad1, nad2, nad3, nad4, nad4L, nad5, nad6, nad7, nad9, rps3, and rps4), plus three cytochrome c biogenesis genes (ccmB, ccmFc, and ccmFn), were extracted from the 23 representative species mt genomes to estimate a phylogenetic tree. Exons of these genes were extracted and sequentially joined together using local Perl scripts. The orthologous genes were aligned using ClustalW (Thompson et al. 1994) and manually adjusted. A phylogenetic tree of the mitochondrial genome was estimated using the neighbor joining algorithm in MEGA version 6.0 (Tamura et al. 2013) with branch point confidence support based on 1000 bootstrap replicates.

Genome features of the S. purpureamitochondrial genome
We assembled the complete S. purpureamt genome into a single circle of total length 598,970 bp from the S. purpurea whole genome project using Roche-454 Sequencing technologies. The sequence has been deposited in the NCBI GenBank Reference Sequence database with accession number NC_029693. We also deposited our S. purpurea mt genome data at GBROWSE (http://bio.njfu.edu. cn/gb2/gbrowse/Salix_pu_mt/). The overall GC content is 55.06%, with a base composition of 27.24% A, 27.82% T, 22.50% C, and 22.44% G ( Table 1).

Analysis of tandem repeats and SSRs
Tandem repeats (TRs) are DNA sequence motifs that play an important role in genome recombination and rearrangement (Cavalier-Smith 2002;Zhao et al. 2013), and are often used for population and phylogenetic analyses (Nie et al. 2012;Schaper and Anisimova 2015). We found 18 tandem repeats in the S. purpurea mt genome with lengths ranging from 4 to 28 bp (Table 4). Most of the repeats (94%) were distributed in non-coding regions, specifically: 83% in intergenic spacer regions, 11% in introns, and 6% in protein-coding regions. Simple sequence repeats (SSRs), also known as microsatellites, are short tandem repeat sequences with repeat lengths generally between one and six base pairs per unit, and are extensively distributed throughout mitochondrial genomes (Provan et al. 2001;Chen et al. 2006). SSRs are important genetic molecular markers, widely used in assisted breeding (Rafalski and Tingey 1993), population genetics (Doorduin et al. 2011;He et al. 2012;Powell et al. 1995), plant typing (Xue et al. 2012;Yang et al. 2011), and genetic linkage map construction (Pugh et al. 2004). We identified 404 SSR motifs in the S. purpurea mt genome with the microsatellite identification tool MISA (Thiel et al. 2003), accounting for 3810 bp of the total sequence. Among these SSRs, 171 have mononucleotide, 157 have dinucleotide, 17 have trinucleotide, 49 have tetranucleotide, nine have pentanucleotide, and one has hexanucleotide repeat motifs (Fig. 2a). Most of the mononucleotide repeats (90.7%) are composed of A/T, the 23 dinucleotides are all composed entirely of AT/TA, and the rest of the SSRs also have a high A/T content (Additional file 1: Table S1). These results are consistent with observations that mitochondrial SSRs are generally composed of short polyadenine (polyA) or polythymidine (polyT) repeats (Kuang et al. 2011). The high A/T content in mitochondrial SSRs contributes to a biased composition, such that the overall AT content is 55.06% in the S. purpurea mt genome. Moreover, it is clear that SSRs are most abundant in intergenic spacers versus other regions, and these account for 90.35% of all SSRs detected. The remaining 6.44, 2.48, and 0.74% of SSRs are in introns, protein-coding regions, and rRNA regions, respectively (Fig. 2b).

Comparison with other mitochondrial genomes
Multiple complete mt genomes provide an opportunity to compare variation in size, structure, and sequence content at the genomic level (Alverson et al. 2011a). We selected 35 land plant mt genomes and compared features to observe the variation among them and the S. purpurea mt genome (Additional file 1: Table S2). The mt genome size of our samples ranges from 104,239 bp in Anomodon rugelii to 982,833 bp in Cucurbita pepo, and the GC content ranges from 39.93% in Bucklandiella orthotrichacea to 53.02% in Welwitschia mirabilis. Because of a large number of open reading frames (ORFs) coding for proteins of unknown function in plant mt genomes, and frequent plastid DNA insertions including mitochondrial tRNA genes (Notsu et al. 2002;Marechal-Drouard et al. 1990 We particularly compared the S. purpurea mt genome with the Populus tremula mt genome (NC_028096), another member of the Salicaceae family. The P. tremula mt genome is 783,442 bp long, which is much larger than that of S. purpurea, however, its base composition of 27.62% A, 22.36% C, 22.38% G, 27.64% T, with a slight A + T bias of 55.25%, is similar to that of the S. purpurea mt genome. As described previously, the complete P. tremula mt genomehas three rRNA genes, 22 tRNA genes, and 33 protein-coding genes. Upon a comparison ccmFn mttB of all orthologous genes between the two genomes, three PCGs (rpl10, rps1, and rps14) and three tRNA genes (trnH-AUG, trnK-CUU, and trnS-UGU) are seen to be present in the P. tremula genome, but not in the S. purpurea genome, while only two tRNA genes (trnC-ACA and trnV-GAC) exist in the S. purpurea genome that do not exist in the P. tremula genome. The P. tremula genome has 838 bp of tandem repeats, while S. purpurea has 665 bp ( Table 5). The S. purpurea mitogenome, with its smaller gene count, sparser PCG annotation, and fewer tandem repeat, compared with P. tremula, may provide insight to further understand the divergent evolution between willow and poplar.

Phylogenetic analysis
The dramatic increase in the number of sequenced mt genomes provided by NGS technology can yield unique insights into the phylogenetic relationships among plants.

Conclusions
The mitochondrial genome is proving to be an effective and important tool for gaining insight into species evolution. Plant mt genomes have striking differences in structure, size, gene order, and gene content. This has generated significant interest in exploring and further understanding plant mitochondrion evolution. Our investigation of the complete S. purpurea mt genome is an important addition to the limited amount of genomic data available for the Salicaceae. The S. purpurea mt genome possesses most of the common characteristics of higher plant mt genomes. Our comparative and  phylogenetic analyses should contribute to a more comprehensive understanding of mitochondrion molecular evolution in higher plants.