We present here an
search of fungal sterol-esterase/lipase and bacterial depolymerase sequences from environmental metagenomes. Both enzyme types contain the α/β-hydrolase protein fold. Analysis of DNA conserved motifs, protein homology search, phylogenetic analysis, and protein 3D modeling have been used, and the efficiency of these screening strategies is discussed. The presence of bacterial genes in the metagenomes was higher than those from fungi, and the sequencing depth of the metagenomes seemed to be crucial to allow finding enough diversity of enzyme sequences. As a result, a novel putative PHA-depolymerase is described.
Model Sequences and Datasets
Model sequences of representative proteins from the enzyme families of interest were selected. Lipases Lip1, Lip2, and Lip3 from
, and sterol-esterases from
were chosen among the members of the
-like family (accession numbers in
). All of them are well characterized at the genetic and biochemical levels. Analysis using MEME software (
) corroborated the presence of the conserved amino acidic sequences GGGF and GESAG into the model proteins.
Phylogenetic analysis. (A) Phylogenetic tree of the putative sterol-esterases/lipases from the different metagenomes analyzed, as well as the model sequences from the C. rugosa-like family and representatives from other families. The selected samples are indicated in brackets, and came from soil metagenomes (Permafrost Bonanza Creek and Soil Miscanthus Kellogg), insect-associated habitats (Termite Fungus Garden and Atta Cephalotes Fungus Garden), and sea water (GS017 Coastal Caribbean and GS022 Open Ocean). (B) Phylogenetic tree of the putative PHA depolymerases, as well as the bacterial model sequences and sequences from other families. The selected samples are indicated in brackets, candidate 2066792343 was from sea water metagenome (Coastal North James Bay; MG-RAST ID 4441596) and 1092955233669 from insect-associated habitat (Termite Fungus Garden; JGI Project ID 401007). The trees were built using MUSCLE for multiple sequence alignment, and the Maximum-Likelihood to calculate distances. Bootstraping was made with 1000 repetitions.
From the different PHA depolymerase families, mclPHAs from
were selected as models of intra- and extracellular enzymes containing the conserved amino acidic sequences SWGGA and GISSG, respectively (accession numbers in
). As members of the new family of depolymerases from actinomycetes, mcl-PHAs containing the conserved amino acidic sequence GHSQGG from
SO1 were selected (accession numbers in
). The genetic and biochemical traits of these enzymes are also well known
Eighty-one nucleotide datasets from different metagenomes were downloaded from the servers of MG-RAST (
), IMG/M (
), and CAMERA (
) databases. The metagenomes were selected from diverse environments to maximize genetic variability. Using these metagenomes, more than 7 million assembled contigs were analyzed (Table S1).
- Metagenome Screening
We have used a two-step process to search for genes codifying for the proteins of interest in public metagenomes: the first step was to carry out a sequence similarity screening using two different strategies: The first similarity screening strategy consisted in comparing DNA sequences present in the 81 metagenomes translated in the six possible reading frames by means of BLASTx (e-value of 10
) against two custom databases, one for each target protein family. An esterase/lipase-specific database was built with a total of 4,805 protein sequences from the NCBI non-redundant (NR) database containing in their name the terms “esterase” or “lipase.” In the case of the depolymerases, a custom database was made with a total of 1,275 sequences from the NCBI NR database containing the term “depolymerase,” plus the sequences from mcl-PHA depolymerases present in the “PHA depolymerase Engineering Database” (DED) (
The second strategy used to look for candidates was based on the translation in the six possible reading frames of the nucleotide sequences from the 81 metagenomes and their use as the reference database (40 Gb size). The model sequences described in the Model Sequences and Datasets section were compared against this database using BLASTp with an e-value of 10
to identify sequences highly related to the reference model proteins.
In all cases, the sequences giving a positive hit in the different BLAST comparisons were checked for the presence of introns, and nucleotide sequences were translated to protein in the correct reading frame and written in different Fasta files.
To carry out the second step in the screening for candidate sequences, each individual file with the selected translated sequences was filtered using the Bioedit 7.1.3 software to check for the conserved motifs from the model sequences. After that, the sequences shorter than 200 amino acids for PHA depolymerases and 500 amino acids for sterol-esterase/lipases were removed because they were too short to contain an ORF. In the case of the sterol-esterase/lipases, sequences with a distance longer than 100 amino acids between the motifs GGGF and GESAG were deleted.
- Phylogenetic Analysis
A phylogenetic tree was built for each kind of enzyme using the representative model sequences and the putative sequences selected after metagenome screening. A bootstrap (1,000 repetitions) unrooted tree was built with the Mega 5.1 software (default parameters), using MUSCLE for multiple sequence alignment and the Maximum-Likelihood method to calculate distances. Sequences grouping with the versatile lipases and sterol-esterases from the
-like family, the
GK13, or actinomycetes-like mcl-PHA depolymerases, were selected for further analyses.
- Sequence Analysis and Structural Models
The sequences from the sterol-esterase/lipase candidates selected in the previous section were compared against the NCBI NR database using BLASTp. Sequences from the PHA depolymerase candidates selected in the previous section were compared against the NCBI NR and “DED” databases (
) using BLASTp.
A tridimensional model of each selected putative protein was generated using the programs implemented in the protein homology-modeling server SWISS-MODEL (Swiss Institute of Bioinformatics)
. The template with higher similarity, automatically chosen by the SWISS-MODEL server for sequence 1092955233669, was the esterase 3IA2 from
(Qmean Z-Score 3.3112), for 2066792343 the hydrolase 3OM8 from
PA01 (Qmean Z-score 3.033), and for PHA depolymerase from
KT2440 was the hydrolytic enzyme PA3053 from
PAO1 (4F0J) (Qmean Z-score 5.214). Since the crystal structure of the PHA depolymerase from
KT2440 is still unknown, all proteins were modeled using the same template, the hydrolytic enzyme PA3053 from
PAO1 (4F0J). The models were exhaustively analyzed using PyMol 1.1 (
) and putative intramolecular tunnels were modeled from the catalytic serine using Caver 2.0 ver. 0.003
- Search in Public Microbial Genomes
The search of putative enzymes was carried out by using different strategies. A comparison of the sequences from metagenomes with those in the custom esterase/lipase database rendered 14,237 candidates, but only 38 of them contained the two conserved motifs. After filtering the sequences, only 11 candidates remained. These candidates were translated in the correct reading frame, and the presence of introns was not detected (
). In the case of the depolymerases, the comparison rendered 110,792 candidates. After filtering, 10 sequences containing the SWGGA conserved motif remained, and among them only six were not redundant. The GISSG motif was not present in any sequence, and the GHSQGG motif was found in four sequences (
Number of candidates selected remaining at each step of the screening.
The search criteria used are indicated in parentheses.
BLAST comparison against the combined metagenome database with the
-like family model sequences did not render any hit. Using the model PHA depolymerase sequences provided four hits containing the conserved motif from
KT2442, which were redundant with the above-mentioned (
Differences in the results obtained with the different strategies are due to the stringency level used, the second approach being more restrictive and the first approach more prone to render false positives.
The differences found in the number of positive matches for the different groups of enzymes can be attributed to the scarce abundance of DNA from eukaryotes in the environment
. In addition, the presence of introns in some eukaryotic genes can be a drawback in the similarity search strategies that we have used.
members are among the most ubiquitous bacteria in the environment
. The fact that several sequences from
were repeated in different metagenomes, and the low abundance of sequences from actinomycetes and fungi, reflect the low sequencing depth of the different metagenomes. A much higher number of sequences seems to be crucial to allow finding less abundant or rare sequences like enzymes from the secondary metabolism.
- Phylogenetic Analysis
In the case of fungal sterol-esterases/lipases (
A), 8 out of 11 candidates grouped with the carboxyl esterases from
. This is a group of esterases (a/bH1.05) closely related with the lipases from
, and also contains the motifs GGGF and GESAG
is a very common genus in many habitats
. Three of the candidates grouped with the lipases from the
Brefeldin A family
; however, these enzymes have not been reported to hydrolyze sterol esters.
B shows how candidates 2066792343, 2042085485, and 1092955233669, containing the SWGAA motif, presented high similarity with the PHA depolymerase from
KT2440. Sequence 2208943353, which grouped with the sequences from actinomycetes, did not present homology with any PHA depolymerase after BLAST comparison, but it presented similarity with α/β hydrolases from
Dietzia, Rhodococcus, Kribella
. Three sequences grouped with GK13, but they did not contain the GISSG motif, and three more did not group with any model sequence.
The candidates 2066792343 and 1092955233669 gave positive matches in the two strategies employed and were selected for further analyses (
As a general consideration, our sequence-based screening methods rely on similarity to known sequences or to conserved motifs. Therefore, this approach is unable to detect genes with novel sequences. Nevertheless, unlike function-based methods, this approach is useful in screening genes without need of heterologous gene expression and protein folding in the selected host.
- Molecular Modeling Analysis of the Selected Candidates
The two putative PHA depolymerase sequences were compared against the NCBI database using BLASTp. Sequence 1092955233669 had 99% identity with a predicted poly(3-hydroxyalkanoate) depolymerase from the genome of
WH6, presenting three amino acid substitutions. Sequence 2066792343 displayed 98% identity with a predicted poly(3-hydroxyalkanoate) depolymerase from the genome of
, with 16 amino acid substitutions.
Molecular 3D models of the candidates and the model sequence were generated using the same template (
). The model from
KT2440 showed a typical esterase structure, with the catalytic Ser close to the mouth of the catalytic pocket. The prediction of the internal tunnels showed a big channel coincident with the substrate binding pocket (
A). Candidate 2066792343 displayed a very similar model structure (not shown), but in the case of 1092955233669 the internal tunnel looked wider (
B). Small differences in the substrate binding pocket could affect the enzyme specificity, admitting bigger substrates, or catalytic efficiency
Tridimensional structure models of the proteins. (A) 747106_KT2440_i-nPHAmcl. (B) Depolymerase 1092955233669, generated using the SWISS-MODEL server. Internal tunnels in each structure were modeled using Caver 2.0.
In conclusion, we present here a useful sequential strategy to identify enzymes with potential biotechnological interest by means of metagenome mining. Search of genes from eukaryotes may be penalized mainly by the presence of introns and the predominance of DNA from prokaryotes in the environmental samples. After screening more than 7,000,000 sequences, two putative PHA depolymerases from Pseudomonas were selected. One of them presented specific features that may confer different properties.
This work was supported by the Spanish projects BIO2009-0844, BIO2012-36372, and S-2009AMB-1480. J. Barriuso is thankful for the financial support from the JAE-DOC CSIC program. We thank the CIB-CSIC bioinformatics facility personnel.
MEME SUITE: tools for motif discovery and searching.
Nucleic Acids Res.
DOI : 10.1093/nar/gkp335
Ramos Solano B
Gutierrez Mañero FJ
Effect of inoculation with putative PGPR isolated fromPinussp. onPinus pineagrowth, mycorrhization and rhizosphere microbial communities.
J. Appl. Microbiol.
DOI : 10.1111/j.1365-2672.2008.03862.x
Estimation of bacterial diversity using next generation sequencing of 16S rDNA: a comparison of different workflows.
DOI : 10.1186/1471-2105-12-473
Fungal geno mes mining to discover novel sterol esterases and lipases as catalysts.
DOI : 10.1186/1471-2164-14-712
Identification of novel positive-strand RNA viruses by metagenomic analysis of archaea-dominated Yellowstone hot springs.
DOI : 10.1128/JVI.07196-11
Production, isolation and characterization of a sterol esterase fromOphiostoma piceae.
BBA Proteins Proteomics
DOI : 10.1016/S1570-9639(02)00378-3
HMMER web server: interactive sequence similarity searching.
Nucleic Acids Res.
DOI : 10.1093/nar/gkr367
de la Mata I
Characterization of a novel subgroup of extracellular medium-chain-length polyhydroxyalkanoate depolymerases from actinobacteria.
Appl. Environ. Microbiol.
DOI : 10.1128/AEM.01707-12
Cloning and expression of the polyhydroxyalkanoate depolymerase gene fromPseudomonas putida, and characterization of the gene product.
DOI : 10.1023/B:BILE.0000045657.93818.18
The SWISS-MODEL repository and associated resources.
Nucleic Acids Res.
DOI : 10.1093/nar/gkn750
Purification and characterisation of a novel steryl esterase fromMelanocarpus albomyces.
Enzyme Microb. Technol.
DOI : 10.1016/j.enzmictec.2005.10.013
Variability within theCandida rugosalipases family.
DOI : 10.1093/protein/7.4.531
Structural insights into the lipase/esterase behavior in theCandida rugosalipases family: crystal structure of the lipase 2 isoenzyme at 1.97A resolution.
J. Mol. Biol.
DOI : 10.1016/j.jmb.2003.08.005
IMG/M: the integrated metagenome data management and comparative analysis system.
Nucleic Acids Res.
DOI : 10.1093/nar/gkr975
Computation of tunnels in protein molecules using Delaunay triangulation.
The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes.
DOI : 10.1186/1471-2105-9-386
EMBOSS opens up sequence analysis. European molecular biology open software suite.
DOI : 10.1093/bib/3.1.87
Lipase engineering database: understanding and exploiting sequence-structure-function relationships.
J. Mol. Catal. B Enzym.
DOI : 10.1016/S1381-1177(00)00092-8
Metagenomics, biotechnology with non-culturable microbes.
Appl. Microbiol. Biotechnol.
DOI : 10.1007/s00253-007-0945-5
Community cyberinfrastructure for advanced microbial ecology research and analysis: the CAMERA resource.
Nucleic Acids Res.
DOI : 10.1093/nar/gkq1102
Prospecting for novel biocatalysts in a soil metagenomes.
Appl. Environ. Microbiol.
DOI : 10.1128/AEM.69.10.6235-6242.2003
Crystal structure of brefeldin A esterase, a bacterial homolog of the mammalian hormone-sensitive lipase.
Nat. Struct. Biol.
DOI : 10.1038/7576