ProteoSeeker – Skretas Lab

About ProteoSeeker

ProteoSeeker is a feature-rich metagenomic analysis tool for accessible and comprehensive metagenomic exploration designed for non-expert users. It allows for the identification of novel proteins that belong to user-defined protein families in metagenomic datasets, while performing taxonomy analysis.

Sampling Site Documentation: Specific characteristics of the sample’s environmental source, including factors such as location, habitat, sampling conditions and collection method are documented.
Sample Collection: The metagenomic material is collected from the environmental niche of interest.
DNA Isolation and Preparation: Following DNA extraction, the metagenomic material is prepped for sequencing.
Next-Generation Sequencing (NGS): NGS is performed to generate the dataset(s) with the reads derived from the sample.
NGS Data Processing: Datasets and metadata are shared in open-access databases, facilitating collaborative research and data reuse. Such files can be provided directly to ProteoSeeker for analysis, forming the exploration ground for the tool. Users may download these datasets for input into ProteoSeeker or, for samples from the NCBI’s SRA database, simply input the SRA accession number directly. For example, ProteoSeeker may be utilized for the discovery of novel proteins/enzymes originating from environments of interest, thereby enhancing the scientific community’s ability to explore microbial ecosystems.
ProteoSeeker Analysis: The selected dataset(s) or SRA accession are submitted to ProteoSeeker. The tool identifies putative proteins derived from the reads.
Functional Analysis: ProteoSeeker offers two core functionalities through its “seek” and “taxonomy” modes, purposed in discovering proteins/enzymes from specific protein families and performing taxonomic assignment of the identified proteins, respectively.
Protein Family Profiling: Protein family profile Hidden Markov Models (pHMMs) from the Pfam database form the basis for the discovery of novel proteins/enzymes from specific protein families, thus with specific functionalities.
Taxonomic Assignment: ProteoSeeker also supports the assignment of one or more taxa to identified proteins, aiding in the understanding of microbial community composition.

Pipeline

ProteoSeeker offers two main functionalities applied through the “seek” mode and the “taxonomy” mode, with a multitude of options, catering to both beginners and advanced users. For those unfamiliar with metagenomic analysis tools, ProteoSeeker provides pre-defined options, while more experienced users have the flexibility to modify the behavior of specific tools within the pipeline. The possible types of input for ProteoSeeker include an SRA code, reads in FASTQ files, contigs or genomes or proteins in FASTA format. If an SRA code is provided, the corresponding SRA file and FASTQ file(s) are generated.

Seek mode

The seek mode identifies proteins that may belong to selected protein families. The seek mode offers three types of analysis: “Type 1”, “Type 2” and “Type 3”.

The steps of the seek mode of ProteoSeeker are described below based on the figure above. Each stage in the figure is colored based on the mode it belongs to (blue for the seek mode and green for the taxonomy mode).

The selected “seek” protein families are determined based on their input codes and their profiles are collected. Type 1, 2, 3 analysis.
The “seek profile database” (spd) is created. Type 1, 2, 3 analysis.
The “seek” protein names associated with the selected families are collected. Type 2, 3 analysis.
The protein database is filtered based on the collected protein names and the “seek filtered protein database” (sfpd) is created. Type 2, 3 analysis.
The reads in the FASTQ files undergo several quality control checks by FastQC. Type 1, 2, 3 analysis.
The reads are preprocessed by BBDuk and reanalyzed by FastQC. Type 1, 2, 3 analysis.
The preprocessed reads are assembled into contigs by Megahit. Type 1, 2, 3 analysis.
Protein coding regions (pcdrs) are predicted in the contigs by FragGeneScanRs. Type 1, 2, 3 analysis.
CD-HIT is used to reduce the redundancy of the pcdrs. Type 1, 2, 3 analysis.
The pcdrs are screened against the spd through HMMER. Any pcdr with at least one hit from this screening is retained (set 1). Type 1, 2, 3 analysis.
The rest of the pcdrs are screened against the sfpd through DIAMOND BLASTP, retaining only those with at least one hit with an e-value lower than the threshold (set 2). In addition, set 1, if not empty, is screened against the Swiss-Prot protein database through DIAMOND BLASTP. Type 2, 3 analysis.
Both sets are screened against all the profiles of the Pfam database through HMMER. Type 1, 2, 3 analysis.
Topology prediction is performed by Phobius. Type 1, 2, 3 analysis.
Motifs provided by the user are screened against each protein. Type 1, 2, 3 analysis.
The protein family of each protein is predicted. Type 1, 2, 3 analysis.
Annotation files are written. Type 1, 2, 3 analysis.

Taxonomy mode

The taxonomy mode performs taxonomic classification of the proteins discovered through sample analysis. The taxonomy mode can be applied through either of two routes, the “Kraken2 route”, which is based on the taxonomy classification of the reads by Kraken2, or the “COMEBin/MetaBinner route” which is based on binning the contigs through COMEBin or MetaBinner and searching for the taxonomy of the proteins, through the “taxonomy filtered protein database” (tfpd).

The steps of the taxonomy mode of ProteoSeeker are described below based on the figure above. Each stage is colored based on the mode and taxonomy route it belongs to (blue for the seek mode, green for the taxonomy mode, orange for the Kraken2 route and purple for the COMEBin/MetaBinner route).

Common stages:

The SRA file is downloaded and converted to FASTQ files.
The reads of the FASTQ files undergo several quality control checks by FastQC.
The reads are preprocessed by BBDuk and reanalyzed by FastQC.
The preprocessed reads are assembled into contigs by Megahit.
Protein coding regions (pcdrs) are predicted in the contigs by FragGeneScanRs.
CD-HIT is used to reduce the redundancy of the pcdrs.
Bowtie2 maps the reads to the contigs.
Annotation files are generated.

Kraken2 route stages:

Species are assigned to the reads based on Kraken2. Bracken then provides the abundances of these species.
Through the read-contig mapping, each species is quantified for each contig. Species are assigned to the contigs.
The contigs are binned based on their species.
Species are assigned to the bins.
Species are assigned to the genes and proteins of the bins.

COMEBin/MetaBinner route stages:

The selected “taxonomy” protein families are determined based on their input codes and their profiles are collected.
The “taxonomy profile database” (tpd) is created.
The “taxonomy” protein names associated with the selected families are collected.
The protein database is filtered based on the collected protein names and the “taxonomy filtered protein database” (tfpd) is created.
The contigs are binned based on COMEBin or MetaBinner.
The pcdrs are screened against the tpd through HMMER.
Any pcdr with at least one hit against the tpd is screened against the tfpd through DIAMOND BLASTP.
Taxon names are converted to TaxIds, and the latter are used to query taxonomic lineages, based on TaxonKit. Taxa are assigned to the bins and to their genes and proteins.
Each bin, along with any taxa assigned to it, is quantified based on the reads mapped to its contigs.

Installation

Pipeline tools

All the tools are automatically installed by the installation process of ProteoSeeker or have already been set in the Docker image of ProteoSeeker. The specific versions of the tools included in the ProteoSeeker’s installation are the ones also used for the evaluation of the “seek” and “taxonomy” modes of ProteoSeeker. For some packages more than one installation method is provided. If the first method is not successful then the next one is attempted. The versions of conda and of these tools are the following:

conda 24.1.2
bbmap: 39.01
bowtie2: 2.5.3
cd-hit: 4.8.1
comebin:
1. Conda: 1.0.4 – Used for the evaluation.
2. Source: Branch: “1.0.4”.
diamond: 2.1.9
fastqc 0.12.1
hmmer: 3.4
kraken2:
1. Conda: 2.1.3 – Used for the evaluation.
2. Source: Branch: “v2.1.3”
megahit: 1.2.9
metabinner:
1. Source: Branch: “master”, Hash: “50a1281e8200d705a744736f23efe53c6048bbe8” – Used for the evaluation.
2. Conda: 1.4.4
sra-tools: 3.1.0
taxonkit: 0.16.0
csvtk: 0.30.0
FragGeneScanRs: 1.1.0

Phobius

In order to use the topology and signal peptide predictions provided by Phobius you must download Phobius from https://phobius.sbc.su.se/data.html. In any other case, ProteoSeeker will run without performing topology and signal peptide predictions in its seek functionality.

Databases

The latest versions of the databases 1-4 are installed automatically by ProteoSeeker. Only the protein database (5) should be installed by the user. Be certain that the system has enough available memory space to hold the decompressed nr database, which is approximately 400 GB.

Pfam database: Latest – Automatic installation
Swiss-Prot/UniprotKB database: Latest – Automatic installation
GTDB taxonomy taxdump files: Latest – Automatic installation
Kraken 2/Bracken Refseq indexes: Collection Standard-8: Latest – Automatic installation
nr database: Latest – Non-Automatic installation

Installation Methods

For detailed installation instructions please visit the ProteoSeeker Repository.

It is suggested to run ProteoSeeker in a Docker container through its image, rather than directly through the command-line, when possible. Therefore, it is proposed to install Docker and download the Docker image of ProteoSeeker. Running ProteoSeeker through the command-line would be necessary to perform the tests described in the evaluation section, or when the same SRA sample needs to be analyzed multiple times in which case running ProteoSeeker directly through the command-line would retain the SRA file after it is downloaded and processed and there would be no need to download and process it again in future runs.

Docker

To install ProteoSeeker from Docker Hub as a Docker image, Docker must be installed in your system. To install Docker in Ubuntu, follow the instructions provided by the link below:

Docker engine for Ubuntu: https://docs.docker.com/engine/install/ubuntu/

Then, download the image of ProteoSeeker from Docker Hub. There are two versions. The “main_v1.0.0” version contains the “Kraken 2/Bracken Refseq indexes Collection Standard-8 database” while the “light_v1.0.0” version does not. Hence, the main_v1.0.0 version can be used directly to run the seek or the taxonomy mode of ProteoSeeker, specifically through the Kraken2 route. The light_v1.0.0 version can be used directly to run only the seek mode of ProteoSeeker. Neither version contains a protein database. The process of using a protein database through Docker is described below. Both versions can be modified to utilize a protein database and thus be used to run the seek mode type 2 or 3 analysis and the taxonomy mode through the COMEBin/MetaBinner route of ProteoSeeker.

The main_v1.0.0 version has a download size of 12.92 GB and decompressed has a size of 29.9 GB. To install the main_v1.0.0 version use one of the following commands:

sudo docker image pull skretaslab/proteoseeker

sudo docker image pull skretaslab/proteoseeker:latest

sudo docker image pull skretaslab/proteoseeker:main_v1.0.0

The light_v1.0.0 version has a download size of 7.42 GB and decompressed has a size of 21.8 GB. To install the light_v1.0.0 version use the following command:

sudo docker image pull skretaslab/proteoseeker:light_v1.0.0

GitHub

Prerequisites

Anaconda

To install ProteoSeeker from source code, conda, from Anaconda, must be installed and activated in your system. Instructions for the installation of Anaconda in Linux are provided in the following link:

https://docs.anaconda.com/free/anaconda/install/linux/

Necessary to download the ProteoSeeker repository.

Dependencies

All dependencies, except for the protein database, are automatically installed by the installation process of ProteoSeeker.

Run ProteoSeeker

Parameter file

In general, the easiest and suggested way to run ProteoSeeker is by using a parameter file. A parameter file should, at least, contain the parameters for the options of ProteoSeeker which are to be modified from their default values. Parameter files for different case-scenarios are offered by the installation of ProteoSeeker for utilizing it in a Docker container or directly through the command-line. The “template” parameter files contain all options. In addition, we advise that the paths used as input to ProteoSeeker (for files or databases) contain no whitespaces and are absolute paths instead of relative paths, although ProteoSeeker is designed to handle these cases.

Options

The options of ProteoSeeker as a command-line tool, their default parameters and descriptions are described in ProteoSeeker’s GitHub repository: https://github.com/SkretasLab/ProteoSeeker/blob/main/README.md#32-options

Run with Docker

ProteoSeeker can run in a container created from its image based on a bind mount or volume. The version of Docker used to create the images and test the containers is that of “27.1.1”. The bind mount or volume is primarily used to provide input files, parameter files, databases, an output directory and the Phobius installation, from the host system to the container. We advise using a bind mount over a volume, due to its fewer requirements in providing the proper privileges in order to access the shared files. In both cases (bind mount and volume), the protein database provided as an example is a small part of the nr database with additions of proteins associated with RNA polymerases. It is used to test that the functionality of the seek mode through type 2 analysis and the taxonomy mode through the route of COMEBin/MetaBinner function properly in ProteoSeeker. You should provide your own protein database, ideally the decompressed nr database, in order to use properly the seek mode through the type 2 analysis and the taxonomy mode through the COMEBin/MetaBinner route of ProteoSeeker. To use any other type of analysis and route of the modes of ProteoSeeker, the protein database is not necessary. ProteoSeeker will detect and utilize Phobius for the topology prediction, if Phobius is installed in the proper directory (“phobius”) of the shared directory (bind mount or volume), otherwise no topology predictions will take place for the proteins. For both cases of bind mounts and volumes, you can perform a test based on a template parameter file (located at the “parameter_files” directory in the bind mount or volume) each of which runs a different analysis. The test is selected with a number based on one of the scripts “docker_bindmount_run_proteoseeker.sh” and “docker_vol_run_proteoseeker.sh”. All template parameter files are ready to be used to run ProteoSeeker by analyzing the sample with the SRA code “SRR12829170”. Each parameter file is also set up to handle FASTQ paired-end input. Furthermore, the light_v1.0.0 Docker image may be used to run a container for ProteoSeeker for the Kraken2 taxonomy route only if you provide a path to a Kraken2 database in the shared directory (bind mount or volume). The selections are described below:

Selection	Mode	Analysis Type	Route	Input
1	seek & taxonomy	type 3	Kraken2	SRA or FASTQ paired-end
2	seek & taxonomy	type 3	COMEBin/MetaBinner: MetaBinner	SRA or FASTQ paired-end
3	seek & taxonomy	type 3	COMEBin/MetaBinner: COMEBin	SRA or FASTQ paired-end
4	seek	type 3	–	SRA or FASTQ paired-end
5	taxonomy	–	Kraken2	SRA or FASTQ paired-end
6	taxonomy	–	COMEBin/MetaBinner: MetaBinner	SRA or FASTQ paired-end
7	taxonomy	–	COMEBin/MetaBinner: COMEBin	SRA or FASTQ paired-end

Setup details for the Docker bind mount & volume can be found at:

https://github.com/SkretasLab/ProteoSeeker/blob/main/README.md#33-docker

Run from the command-line

To run ProteoSeeker through the command-line, a parameter file facilitates the process greatly. By using one of the template parameter files, you can easily customize the values for the options of ProteoSeeker and run it. To run ProteoSeeker, at first, its environment should be activated. To run ProteoSeeker by its seek mode and type 2 or 3 analysis or by its taxonomy mode and COMEBin/MetaBinner route, the user should set the path to the protein database in the parameters file or provide it as a parameter. You can run ProteoSeeker based on certain parameter files by the seek mode and type 1 analysis or by the taxonomy mode and Kraken2 route, without the need to make any modification in the parameter file or provide any other parameter. The table below links the template parameter files with the mode and analysis type or route applied by ProteoSeeker in the run. All template parameter files are ready to be used to run ProteoSeeker by analyzing the sample with the SRA code “SRR12829170”. Each parameter file is also set up to handle either FASTQ paired-end input or FASTA contig(s)/genome(s) input given that the SRA code is removed from the parameter file. The template parameter files 1, 2, 3, 4 and 5 can directly be used to run ProteoSeeker without modifications, as they do not require a protein database. Parameter file 6, needs a protein database and certain modifications to be used by ProteoSeeker.

Index	Parameter File	Mode	Analysis Type	Taxonomy Route	Input
1	par_seek_p.txt	seek	type 1	–	SRA or FASTQ paired-end
2	par_seek_c.txt	seek	type 1	–	SRA or FASTA contigs/genome(s)
3	par_seek_tax_k_p.txt	seek & taxonomy	type 1	Kraken2	SRA or FASTQ paired-end
4	par_seek_tax_mc_p.txt	seek & taxonomy	type 1	COMEBin/MetaBinner: MetaBinner	SRA or FASTQ paired-end
5	par_tax_k_p.txt	taxonomy	–	Kraken2	SRA or FASTQ paired-end
6	par_tax_mc_p.txt	taxonomy	–	COMEBin/MetaBinner: MetaBinner	SRA or FASTQ paired-end

Setup details for running ProteoSeeker through the command-line can be found at:

https://github.com/SkretasLab/ProteoSeeker/blob/main/README.md#34-command-line

Test Cases

All tests for the evaluation were run based on the ProteoSeeker version 1.0.0 and the tool versions described in it, which refer to the “v1.0.0” release of ProteoSeeker. The collection dates for the databases used in the evaluation can be found below. In addition, we also note the download date of the flat file of the reviewed proteins from the Swiss-Prot/UniprotKB database, which was used to collect the information about its proteins in relation to their protein families, Pfam profiles, protein names and protein lengths.

Pfam database: 29/05/2024
Swiss-Prot/UniprotKB database: 29/05/2024
GTDB taxonomy taxdump files: 29/05/2024
Kraken 2/Bracken Refseq indexes: Collection Standard-8: 05/06/2024 (prior to the update)
Kraken 2/Bracken Refseq indexes: Collection Standard-16: 05/06/2024 (prior to the update)
Kraken 2/Bracken Refseq indexes: Collection Standard: 05/06/2024 (prior to the update)
Non-redundant (nr) database: 27/06/2024
Reviewed proteins – Swiss-Prot/UniprotKB flat file: 04/08/2023

Step by step instructions to run the test cases can be found at: https://github.com/SkretasLab/ProteoSeeker/blob/main/README.md#4-test-cases

Code Availability

The source code for ProteoSeeker can be found at the SkretasLab/ProteoSeeker GitHub repository (https://github.com/SkretasLab/ProteoSeeker). ProteoSeeker is also shipped as a Docker image available in Docker Hub at skretaslab/proteoseeker (https://hub.docker.com/repository/docker/skretaslab/proteoseeker/general).

Cite ProteoSeeker

If you find ProteoSeeker useful in your work, please cite: (manuscript under review).

References

[1] Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. https://anaconda.com.

[2] Sayers EW, Bolton EE, Brister JR, Canese K, Chan J, Comeau DC, Connor R, Funk K, Kelly C, Kim S, Madej T, Marchler-Bauer A, Lanczycki C, Lathrop S, Lu Z, Thibaud-Nissen F, Murphy T, Phan L, Skripchenko Y, Tse T, Wang J, Williams R, Trawick BW, Pruitt KD, Sherry ST. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022 Jan 7;50(D1):D20-D26. doi: 10.1093/nar/gkab1112. PMID: 34850941; PMCID: PMC8728269.

[3] https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

[4] sourceforge.net/projects/bbmap/

[5] Bushnell B, Rood J, Singer E. BBMerge – Accurate paired shotgun read merging via overlap. PLoS One. 2017 Oct 26;12(10):e0185056. doi: 10.1371/journal.pone.0185056. PMID: 29073143; PMCID: PMC5657622.

[6] Li D, Liu CM, Luo R, Sadakane K, Lam TW. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015 May 15;31(10):1674-6. doi: 10.1093/bioinformatics/btv033. Epub 2015 Jan 20. PMID: 25609793.

[7] Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014 Mar 3;15(3):R46. doi: 10.1186/gb-2014-15-3-r46. PMID: 24580807; PMCID: PMC4053813.

[8] Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019 Nov 28;20(1):257. doi: 10.1186/s13059-019-1891-0. PMID: 31779668; PMCID: PMC6883579.

[9] Wang Z, You R, Han H, Liu W, Sun F, Zhu S. Effective binning of metagenomic contigs using contrastive multi-view representation learning. Nat Commun. 2024 Jan 17;15(1):585. doi: 10.1038/s41467-023-44290-z. PMID: 38233391; PMCID: PMC10794208.

[10] Wang Z, Huang P, You R, Sun F, Zhu S. MetaBinner: a high-performance and stand-alone ensemble binning method to recover individual genomes from complex microbial communities. Genome Biol. 2023 Jan 6;24(1):1. doi: 10.1186/s13059-022-02832-6. PMID: 36609515; PMCID: PMC9817263.

[11] Van der Jeugt F, Dawyndt P, Mesuere B. FragGeneScanRs: faster gene prediction for short reads. BMC Bioinformatics. 2022 May 28;23(1):198. doi: 10.1186/s12859-022-04736-5. PMID: 35643462; PMCID: PMC9148508.

[12] Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012 Dec 1;28(23):3150-2. doi: 10.1093/bioinformatics/bts565. Epub 2012 Oct 11. PMID: 23060610; PMCID: PMC3516142.

[13] Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421. doi: 10.1186/1471-2105-10-421. PMID: 20003500; PMCID: PMC2803857.

[14] Buchfink B, Reuter K, Drost HG. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021 Apr;18(4):366-368. doi: 10.1038/s41592-021-01101-x. Epub 2021 Apr 7. PMID: 33828273; PMCID: PMC8026399.

[15] http://hmmer.org/

[16] UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023 Jan 6;51(D1):D523-D531. doi: 10.1093/nar/gkac1052. PMID: 36408920; PMCID: PMC9825514.

[17] Boutet E, Lieberherr D, Tognolli M, Schneider M, Bansal P, Bridge AJ, Poux S, Bougueleret L, Xenarios I. UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. Methods Mol Biol. 2016;1374:23-54. doi: 10.1007/978-1-4939-3167-5_2. PMID: 26519399.

[18] Poux S, Arighi CN, Magrane M, Bateman A, Wei CH, Lu Z, Boutet E, Bye-A-Jee H, Famiglietti ML, Roechert B, UniProt Consortium T. On expert curation and scalability: UniProtKB/Swiss-Prot as a case study. Bioinformatics. 2017 Nov 1;33(21):3454-3460. doi: 10.1093/bioinformatics/btx439. PMID: 29036270; PMCID: PMC5860168.

[19] Pedruzzi I, Rivoire C, Auchincloss AH, Coudert E, Keller G, de Castro E, Baratin D, Cuche BA, Bougueleret L, Poux S, Redaschi N, Xenarios I, Bridge A. HAMAP in 2015: updates to the protein family classification and annotation system. Nucleic Acids Res. 2015 Jan;43(Database issue):D1064-70. doi: 10.1093/nar/gku1002. Epub 2014 Oct 27. PMID: 25348399; PMCID: PMC4383873.

[20] Käll L, Krogh A, Sonnhammer EL. A combined transmembrane topology and signal peptide prediction method. J Mol Biol. 2004 May 14;338(5):1027-36. doi: 10.1016/j.jmb.2004.03.016. PMID: 15111065.

Contacts and Bug Reports

Feel free to send questions or bug reports to proteoseeker@fleming.gr

License

The License of ProteoSeeker is GNU General Public License v3.0: Click here to read it.

Contact: proteoseeker@fleming.gr

Cite ProteoSeeker: If you find ProteoSeeker useful in your work, please cite: (to be provided)

Code Availability: The source code for ProteoSeeker can be found in this GitHub repository.