The development of rapid and efficient genome sequencing methods has enabled us to study the evolutionary background of bacterial genetic information. Here, we present comparative genomic analysis of 17
species, for which the genome has been completely sequenced, using the pan-genome approach. The analysis revealed that 34,592 ortholog clusters constituted the pan-genome of these
species, including 2,018 in the core genome, 11,743 in the dispensable genome, and 20,831 in the unique genome. The core genome was converged to a smaller number of genes than reported previously, with 3,096 gene families. Functional enrichment analysis showed that genes involved in transcription were most abundant in the
pan-genome. Finally, we investigated core genes for the sigma factors, mycothiol biosynthesis pathway, and secondary metabolism pathways; our data showed that many genes involved in stress response and morphological differentiation were commonly expressed in
species. Elucidation of the core genome offers a basis for understanding the functional evolution of
species and provides insights into target selection for the construction of industrial strains.
Streptomycetes are active producers of a wide range of secondary metabolites, including more than two-thirds of the natural antibiotics in the pharmaceutical industry
. They are members of the largest genus of actinobacteria, which are ubiquitous in soil and undergo complex differentiation from filamentous mycelia to aerial hyphae, and spores
. For the genome-scale elucidation of the genetic background of secondary metabolites and the rich repertoire of novel enzymes in this genus, extensive sequence analyses have been carried out for different model
strains, such as
. In addition to a high G+C ratio and linear chromosome shape as important genomic features, the
genome encodes a number of sigma factors and transcription factors that are involved in the complex transcriptional regulatory network
. Many genes are involved in morphological differentiation, and tens of gene clusters encode genes that participate in the biosynthesis of secondary metabolites in each strain
To date, the genome sequences of over 30,000 bacterial species have been reported in the NCBI genome database (
). From this abundance of information, comparative genomics analyses between multiple genomes of individual species have been used to reveal extensive genomic inter- and intraspecies diversity
. Among currently available comparative analysis methods, pan-genome analysis has been used to describe the entire gene repertoire of bacterial species through identifying the sum of the core and dispensable genomes
. Thus, a pan-genome is defined as the full set of non-orthologous genes present in species, composed of the core and dispensable genomes; that is sets of genes that are present in all strains and unique to single strains, respectively. This analysis demonstrated how many new genes can be identified from newly sequenced genomes. Several reports of comparative genomic studies have revealed a catalog of genomic components and the evolutionary history of
. However, even the most recent study analyzed five model
, requiring incorporation of current sequence information.
In this study, based on the rapidly increasing number of genomes sequenced, we performed comprehensive analysis of the genomes of all 17
species that have been completely sequenced to date in order to understand their genomic components. We estimated the pan-genome of
and identified the core genome that was conserved in all of the analyzed strains. In addition, ortholog clusters within the pan-genome were classified according to their functions, and genes that showed distinctive characteristics of
were listed. This analysis provides up-to-date information on genomic diversity and core conservation of
genomes, facilitating our comprehensive understanding of this genus.
Materials and Methods
- Nucleotide Sequence Accession Numbers
All the complete genome sequences of the 17
species used for our analysis were retrieved from NCBI FTP (
). The accession numbers for these 17
species are NC_021055 (
sp. PAMC26508), NC_015953 (
sp. SirexAA E), NC_020990 (
J1074), NC_003155 (
MA 4680), NC_016582 (
BCW1), NC_016111 (
NRRL 8057), NC_003888 (
A3(2)), NC_021985 (
Tü 365), NC_020504 (
JCM 4913), NC_016114 (
ATCC 33331), NC_021177 (
DSM 40593), NC_010572 (
NBRC 13350), NC_017765 (
S. hygroscopicus jinggangensis
5008), NC_022785 (
NRRL 5491), NC_013929 (
87 22), NC_018750 (
ATCC 10712), and NC_015957 (
- Pan-Genome Calculation
For the pan-genome computation of
species, PGAP ver. 1.12 was used
. Ortholog clusters were organized using the open reading frame (ORF) contents of each genome with the GF (Gene Family) method using default parameters (E-value: 1e-10, score: 40; identity: 50; coverage: 50). The pan-genome and core genome profiles were then built. Functional enrichment of ortholog clusters was performed using the PGAP program and was used for the classification of clusters of orthologous groups (COGs). Subsequent classification work was performed using an in-house script.
Results and Discussion
- The Pan-Genome of 17StreptomycesSpecies
Seventeen completely sequenced
species genomes available at the NCBI FTP database (
) were used in this study. The genomic characteristics of each species are summarized in Supplementary Table S1. All strains are reported to contain linear chromosomes. Their genome sizes range from 6.3 to 12.7 Mb with G+C contents from 70.6% to 73.3%. The number of predicted coding sequences (CDSs; 5,832–10,022) was positively correlated with their genome size.
Pan-genome analysis of the 17
chromosomes revealed 34,592 ortholog clusters from 1,129,413 total genes that constituted the pan-genome. The size of the
pan-genome may grow with the number of sequenced strains, and this pan-genome can therefore be considered an open pan-genome (
. This trend suggests that
has flexible genome contents, reflecting the diversity of secondary metabolism and morphological differentiation, which is pronounced in this genus. The core genome consisted of 2,018 ortholog clusters (
B and Table S2). This number is smaller than that in a previous report, which described 3,096 gene families based on five
. The ratio of the core genome in each species ranged from 24% to 38% and was negatively correlated with the number of ORFs. Although this number may be decreased when the analyzed genome is added, the number of core genomes would be expected to converge to a constant value, as judged from the slope of exponential decay. The number of dispensable gene families that were conserved in at least two species was 11,743, and the number of ortholog clusters of unique genes that were present in only one strain was 20,831. We called these two groups accessory genomes; these genomes are thought to contribute to the species’ diversity and generally provide functions that were not essential to viability. However, these genes may have conferred a selective advantage to Streptomycetes in their specific environmental niche.
Pan-genome analysis of Streptomyces. (A) Pan-genome and core genome profiles. The numbers of new genes in the Streptomyces pan-genome and core genome are plotted against the number of genomes added. The deduced mathematical function is also reported. (B) Venn diagram showing the number of species-specific gene families in the genome of each species. The number of core genomes is represented in the center.
- Functional Distribution of Ortholog Clusters
Next, we examined the functional classifications of ortholog clusters using the COG database (Table S3). The most abundant COG category in the pan-genome, except poorly or uncharacterized ones, was transcription (K) that included 1,945 gene families. The next abundant COGs were transport and metabolism of carbohydrates (G; 1,242) and amino acids (E; 1,046). In the core genome, the transcription category still encompassed the largest gene families (211), followed by metabolism of amino acids (192) and carbohydrates (130). The abundance of transcriptional regulators, including sigma factors, is a hallmark of Streptomycetes, consistent with their complex transcriptional regulatory networks that support morphological and physiological differentiation
We then investigated the proportion of each conserved group (core, dispensable, and unique genomes) to determine the numbers of genes in each category (
). We found that the occurrence ratio in core genomes was high for the COG categories of translation (J) and nucleotide metabolism (F). This reveals the importance of protein and nucleic acid synthesis as the conserved core function, and the relatively lower diversity of genes within these categories. In comparison, those for secondary metabolism, defense mechanisms, carbohydrate transport/metabolism, and transcription occurred less frequently in the core genomes. Even though the absolute number of gene families in these categories is large, the finding that most of these genes reside in accessory genomes suggests that they provide functions to increase the diversity and uniqueness of the Streptomycetes.
Distribution of orthologous genes based on COG category. The bars are sorted by the proportion of core genomes in each functional category.
- The Core Genome of Streptomycetes
We further investigated the core genome to understand the conserved basic biology of Streptomycetes. In general,
species contain a linear chromosome, which has a “core region” that houses the relatively conserved housekeeping genes and two “arms” that contain more divergent and horizontally transferred genes
. The terminal regions of the chromosomes are highly unstable, and unequal crossing-over between the two arms of the chromosome or between one arm of the chromosome and a linear plasmid also occurs frequently, giving rise to gross rearrangements of the chromosome
. The dynamic nature of the arms is consistent with their high genetic diversity. Therefore, a large part of the terminal region was deleted when the genome-minimized host for heterologous expression was constructed, due to the infrequent occurrence of essential genes at the region
. We confirmed that there was a high frequency of core genes at the central region of the chromosome in most species, consistent with prior knowledge (
Proportion of the core genome according to the location in linear chromosomes. All genomes were normalized to the same size and divided into 100 sections. The plot represents the average ratio of the length of the total and core genes to each section. Error bars indicate the standard deviation of the ratio in each section.
Next, we further examined several groups of core genes in
as a reference strain. First, among the transcription-related genes that occupy 12% of the ORFs in the genome of
, we examined genes for sigma factors that bring diversity in the gene expression pattern by altering the specificity of RNA polymerase. The genome of
A3(2) is known to encode more than 60 different sigma factors
. The COG clusters for sigma factors in the
core genome were assigned to 15 clusters. We found that 25 out of 65 sigma factors encoded in the
genome were included in the 15 core ortholog clusters (
). Genes for the major housekeeping sigma factor HrdB and its paralogs HrdA, HrdC, and HrdD were clustered in a single group (cluster ID 24). This cluster comprised 3–4 genes in each analyzed species, indicating that the multiplicity of these HrdB-like sigma factors is conserved in
. Ortholog cluster 4 contained the largest number of sigma factor genes (
), many of which were reported to function in differentiation and response to osmotic and oxidative stresses
. Except for WhiG (cluster ID 1910), all the other 14 sigma factors are classified as group 4 or ECF-family sigma factors
. The conserved ECF sigma factors in all
spp. include SigU, SigE, SigR, SigR1, SigR, BldN, and SigQ. Among these, some are known to be involved in differentiation and secondary metabolism (SigU, BldN, SigR, and SigT)
, cell wall function (SigE)
, and oxidative stress response (SigR and SigR1)
. Investigation of other conserved sigma factors is needed to unravel the conserved core functions governed by conserved alternate sigma factors.
Conserved genes in 17Streptomycesspecies.
Conserved genes in 17 Streptomyces species.
In addition, conservation of genes involved in the biosynthesis of mycothiol, the major principal thiol compound found in many actinomycetes, was investigated
. This maintains a high level of reducing environment within the cells and protects against disulfide stress. Four putative genes in this pathway,
(cluster ID 705),
(cluster ID 1092),
(cluster ID 953), and
(cluster ID 1601), were conserved in all the species that we analyzed, despite of their scattered location in the chromosome (
). This proved that mycothiol acts as the common reducing agent in
We further examined the conserved core genes that are annotated to be involved in secondary metabolism. Among genes involved in secondary metabolism, more than 95% resided mostly in the accessory (dispensable and unique) genomes of
). This reflects the diversity of secondary metabolism of
spp. Only 5% of genes for secondary metabolism is in the conserved core genome.
lists 27 genes for secondary metabolism that are conserved among 17
spp. They belong to seven COG clusters for secondary metabolism out of 30 clusters predicted
. Most or all of the genes in the 5-hydroxyectoin (4/4), siderophore (2/3), geosmin (1/1), and hopene (11/13) clusters were conserved throughout
spp. 5-Hydroxyectoin is known to have an important role as a compatible solute in response to salt and heat stresses in the
. Geosmin, which is responsible for the odor of soil, is also likely to be produced in all of the strains examined in this work
. Hopene, a pentacyclic triterpene, can provide stability to bacterial membranes at high temperatures and under conditions of extreme acidity
. This study reveals that among secondary metabolites, only a handful of compounds such as hydroxyectoine, geosmin, and hopanoids are universally conserved among
. More intensive investigation of the functions of these metabolites, either characterized or uncharacterized, is in need to understand their roles in the biology of Streptomycetes.
Conserved genes involved in secondary metabolism in 17Streptomycesspecies.
Conserved genes involved in secondary metabolism in 17 Streptomyces species.
The amount of bacterial genomic information has been rapidly increasing with the development of high-throughput DNA sequencing technologies. In particular, the acquisition and understanding of the genome sequences of
are important for drug discovery, because these organisms are an abundant source of secondary metabolites
. In this study, we revealed the conservation of 2,018 and 32,574 gene families (COG clusters) within the core and accessary genomes, respectively, of 17 completely sequenced
species using pan-genome analysis. Functional classification of ortholog clusters showed the distribution of ratios of core and accessory genomes. Furthermore, we investigated the functions of the conserved gene groups, which included the sigma factors, mycothiol biosynthesis pathway, and secondary metabolic pathways. This analysis showed that
species encode many common genes involved in stress response and morphological differentiation. Compared with previous reports
, we could reduce the number of core genes using more completed genomes. Despite of the fewer number of core genes, we could find that many genes and secondary metabolite clusters that respond to stress and external stimulus were still conserved significantly. Therefore, it is concluded that adaptation or survival in various environments is one of the distinguishing characters of
Elucidation of the core genome will provide insights into target selection for genome minimization during the construction of industrial strains or for metabolic engineering. Moreover, this analysis offers a basis for understanding the processes through which information from one strain is transferred to another strain. Integration of genomic information with other -omics studies, such as transcriptomics, proteomics, and metabolomics, will provide an opportunity for understanding more about the functional evolution of
This work was supported by the Intelligent Synthetic Biology Center of the Global Frontier Project (2011-0031957) and the Basic Core Technology Development Program for the Oceans and the Polar Regions (2011-0021053) through the National Research Foundation of Korea (NRF), which is funded by the Ministry of Science, ICT, and Future Planning.
Complete genome sequence of the model actinomyceteStreptomyces coelicolorA3(2).
DOI : 10.1038/417141a
Ten years of bacterial genome sequencing: comparative-genomics-based discoveries.
Funct. Integr. Genomics
DOI : 10.1007/s10142-006-0027-2
Synthesis and uptake of the compatible solutes ectoine and 5-hydroxyectoine byStreptomyces coelicolorA3(2) in response to salt and heat stresses.
Appl. Environ. Microbiol.
DOI : 10.1128/AEM.00768-08
Expression and mechanistic analysis of a germacradienol synthase fromStreptomyces coelicolorimplicated in geosmin biosynthesis.
Proc. Natl. Acad. Sci. USA
DOI : 10.1073/pnas.0337625100
Developmental biology ofStreptomycesfrom the perspective of 100 actinobacterial genome sequences.
FEMS Microbiol. Rev.
DOI : 10.1111/1574-6976.12047
Streptomyces: Molecular Biology and Biotechnology.
Horizon Scientific Press
The ECF sigma factor SigT regulates actinorhodin production in response to nitrogen stress inStreptomyces coelicolor.
Appl. Microbiol. Biotechnol.
DOI : 10.1007/s00253-011-3619-2
Streptomycesmorphogenetics: dissecting differentiation in a filamentous bacterium.
Nat. Rev. Microbiol.
DOI : 10.1038/nrmicro1968
RNA polymerase sigma factor that blocks morphological differentiation byStreptomyces coelicolor.
DOI : 10.1128/JB.183.20.5991-5996.2001
Isolation and characterization ofStreptomyces coelicolorRNA polymerase, its sigma, and antisigma factors.
The extracytoplasmic function (ECF) sigma factors.
Adv. Microb. Physiol.
Comparative genomics ofStreptomyces avermitilis,Streptomyces cattleya,Streptomyces maritimusandKitasatospora aureofaciensusing aStreptomyces coelicolormicroarray system.
Antonie Van Leeuwenhoek
DOI : 10.1007/s10482-007-9175-1
Complete genome sequence and comparative analysis of the industrial microorganismStreptomyces avermitilis.
DOI : 10.1038/nbt820
Comparative genomic hybridizations reveal absence of largeStreptomyces coelicolorgenomic islands inStreptomyces lividans.
DOI : 10.1186/1471-2164-8-229
The sigmaR regulon ofStreptomyces coelicolorA32 reveals a key role in protein quality control during disulphide stress.
DOI : 10.1099/mic.0.037804-0
Conservation of thiol-oxidative stress responses regulated by SigR orthologues in actinomycetes.
DOI : 10.1111/j.1365-2958.2012.08115.x
Genome-minimizedStreptomyceshost for the heterologous expression of secondary metabolism.
Proc. Natl. Acad. Sci. USA
DOI : 10.1073/pnas.0914833107
Molecular regulation of antibiotic biosynthesis inStreptomyces.
Microbiol. Mol. Biol. Rev.
DOI : 10.1128/MMBR.00054-12
Involvement of SigT and RstA in the differentiation ofStreptomyces coelicolor.
DOI : 10.1016/j.febslet.2009.09.025
The microbial pan-genome.
Curr. Opin. Genet. Dev.
DOI : 10.1016/j.gde.2005.09.006
Genomic basis for natural product biosynthetic diversity in the actinomycetes.
Nat. Prod. Rep.
DOI : 10.1039/b817069j
Biosynthesis and functions of mycothiol, the unique protective thiol of Actinobacteria.
Microbiol. Mol. Biol. Rev.
DOI : 10.1128/MMBR.00008-08
Genome sequence of the streptomycinproducing microorganismStreptomyces griseusIFO 13350.
DOI : 10.1128/JB.00204-08
Evidence that the extracytoplasmic function sigma factor sigmaE is required for normal cell wall structure inStreptomyces coelicolorA3(2).
sigmaR, an RNA polymerase sigma factor that modulates expression of the thioredoxin system in response to oxidative stress inStreptomyces coelicolorA3(2).
DOI : 10.1093/emboj/17.19.5776
Defining the disulphide stress response inStreptomyces coelicolorA3(2): identification of the sigmaR regulon.
DOI : 10.1046/j.1365-2958.2001.02675.x
Hopanoids are formed during transition from substrate to aerial hyphae inStreptomyces coelicolorA3(2).
FEMS Microbiol. Lett.
DOI : 10.1111/j.1574-6968.2000.tb09212.x
Actinomycetes biosynthetic potential: how to bridgein silicoandin vivo?
J. Ind. Microbiol. Biotechnol.
DOI : 10.1007/s10295-013-1352-9
Specialized osmotic stress response systems involve multiple SigB-like sigma factors inStreptomyces coelicolor.
DOI : 10.1046/j.1365-2958.2003.03302.x
Genome plasticity and systems evolution inStreptomyces.
DOI : 10.1186/1471-2105-13-S10-S8