Identifying viral OTUs from metagenomic assemblies

This is a bioinformatics pipeline for identifying non-redundant viral populations from metagenomic assemblies, leveraging VirSorter2, CheckV, and geNomad. Predicted viral sequences are filtered, aggregated across samples, and dereplicated at 95% ANI over 85% aligned fraction to generate a set of viral operational taxonomic units (vOTUs) for downstream analyses.

1. Overview

This pipeline performs the following steps:

VirSorter2+CheckV
- a. Viral prediction (VirSorter2 with options --keep-original-seq)
- b. Quality assessment and host trimming (CheckV)
- c. Filtering to retain viral sequences ≥ 5kbp and viral ≥ host genes
geNomad
- a. Viral prediction (geNomad with options --enable-score-calibration)
- b. Filtering to retain viral sequences ≥ 5kbp and FDR < 1%
Aggregation: Combines viral sequences across samples.
Dereplication: Clusters viral sequences at 95% ANI over 85% aligned fraction.

2. Directory structure

viral-identification/
├── README.md
├── vir-id.smk
├── config.yaml
├── samples.tsv  # table of sample IDs and metagenomic assembly path names
└── results/     # outputs (not included)
data/
└── contigs/     # metagenomic assemblies (not included)
resources/
└── genomad_db/  # geNomad database (not included)

3. Dependencies

This pipeline is built using Snakemake. All software dependencies are managed via conda.

Snakemake (tested with v5.26.0)
VirSorter2 (tested with v2.2.4)
CheckV (tested with v1.0.1)
geNomad (tested with v1.8.1)
CD-HIT (tested with v4.8.1)
python (tested with v3.10.0)
pandas (tested with v2.2.0)
Biopython (tested with v1.83)

4. System requirements

This pipeline is designed for high-performance computing (HPC) environments.

Hardware requirements

Memory: Minimum 64 GB RAM
CPU: Scalable from 16 to 48+ cores

5. Installation & Setup

5.1 Clone the repository

git clone https://github.com/CSB5/SPMP_Phages.git

5.2 Install dependencies

conda create -n vir-id -c conda-forge -c bioconda snakemake=5.26.0 virsorter=2.2.4 checkv=1.0.1 genomad=1.8.1 cd-hit=4.8.1 pandas=2.2.0 biopython=1.83
conda activate vir-id

5.3 Download databases

VirSorter2
CheckV
geNomad: The geNomad database path should be specified in config.yaml.

6. Usage

The pipeline requires a set of metagenomic assembly files as input. Modify sample_id, contigs_path, and contig_header_prefix in samples.tsv accordingly.

The following parameters are defined in config.yaml:

Minimum viral sequence length (min_len, default: 5000)
VirSorter2 minimum score (min_vs2_score, default: 0.5)
geNomad maximum FDR (max_gmd_fdr, default: 0.01)
geNomad database path (genomad_db)

Navigate to the project subdirectory to run the pipeline:

cd pipelines/viral-identification
snakemake -s vir-id.smk --cores 48

7. Outputs

All results are saved in the results/ directory. The primary outputs are:

viral.fna: Viral sequences identified by VirSorter2+CheckV and geNomad, aggregated across all samples, and dereplicated at 100% ANI.
v95.fna: Non-redundant vOTUs at 95% ANI over 85% aligned fraction.

These outputs are used as inputs for downstream viral clustering, host association and other analyses.

8. Citation

If you use this pipeline in your research, please cite:

Chen, H. et al. GuFi phages represent the most prevalent viral family-level clusters in the human gut microbiome. bioRxiv. https://doi.org/10.64898/2026.01.26.701711
VirSorter2: Guo, J. et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 37 (2021).
CheckV: Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nature Biotechnology 39, 578–585 (2021).
geNomad: Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nature Biotechnology 42, 1303–1312 (2024).
CD-HIT: Fu, L. et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identifying viral OTUs from metagenomic assemblies

1. Overview

2. Directory structure

3. Dependencies

4. System requirements

Hardware requirements

5. Installation & Setup

5.1 Clone the repository

5.2 Install dependencies

5.3 Download databases

6. Usage

7. Outputs

8. Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Identifying viral OTUs from metagenomic assemblies

1. Overview

2. Directory structure

3. Dependencies

4. System requirements

Hardware requirements

5. Installation & Setup

5.1 Clone the repository

5.2 Install dependencies

5.3 Download databases

6. Usage

7. Outputs

8. Citation