Tutorial 2 - taxonomic profiling with GTDB

In this tutorial, we will use sylph to do

Multi-sample prokaryotic metagenomic profiling with the GTDB-R214 database
output a taxonomic output like MetaPhlAn or Kraken2.

1. Download the GTDB-R214 database.

Note

This tutorial uses the GTDB-R214 database, an older database. A new GTDB version (R220) has 35% more genomes. Feel free to swap out GTDB-r214 for another GTDB version.

Option 1. Download pre-sketched database

Download the pre-sketched (i.e., indexed) GTDB-R214 database provided here. For example,

wget https://storage.googleapis.com/sylph-stuff/v0.3-c1000-gtdb-r214.syldb -O gtdb_database.syldb

# OR

#wget https://storage.googleapis.com/sylph-stuff/v0.3-c200-gtdb-r214.syldb -O gtdb_database.syldb

The c1000 database is smaller and faster, but less sensitive for low-abundance genomes compared to the c200 database.

Option 2. Create your own index (if you already have GTDB downloaded)

Assuming you have the GTDB-R214 genome database present in a folder called gtdb_genomes_reps_r214/, do the following:

Compile all GTDB genomes into a list file: find gtdb_genomes_reps_r214 | grep .fna > gtdb_all.txt
Create a database with: sylph sketch -l gtdb_all.txt -t 50 -o gtdb_database

2. Download and index mouse-gut metagenome

Let's download and index two mouse gut metagenomes:

wget https://storage.googleapis.com/sylph-stuff/mouse_1.fq.gz
wget https://storage.googleapis.com/sylph-stuff/B-mouse_1.fq.gz

wget https://storage.googleapis.com/sylph-stuff/mouse_2.fq.gz
wget https://storage.googleapis.com/sylph-stuff/B-mouse_2.fq.gz

We now have two sets of paired-reads. For paired-end reads, sylph must sketch the reads first as follows:

sylph sketch -1 *mouse_1.fq.gz -2 *mouse_2.fq.gz

sylph can sketch multiple samples at once. This allows for multi-threaded sketching and is more efficient.

This outputs two files called mouse_1.fq.gz.paired.sylsp and B-mouse_1.fq.gz.paired.sylsp. The paired indicates the sketch was created with the -1,-2 options, and the sylsp indicates that it is indexed.

Warning

Do not interleave paired-end reads. Sylph will give more accurate results if you do not concatenate/interleave your reads.

3. Metagenomic profiling with sylph

To multi-sample profile with sylph, run the following:

sylph profile gtdb_database.syldb *mouse_1.fq.gz.paired.sylsp  -t 10 -o results.tsv

This uses 10 threads to profile the metagenome against our database into a file called results.tsv.

Tip

Since sylph v0.6, direct profiling of paired-end reads without sketching is possible:

sylph profile gtdb_database.syldb -1 *_1.fq.gz -2 *_2.fq.gz -t 10 -o results.tsv

Inspecting the results.tsv file gives:

head results.tsv 
Sample_file Genome_file Taxonomic_abundance Sequence_abundance  Adjusted_ANI    Eff_cov ANI_5-95_percentile Eff_lambda  Lambda_5-95_percentile  Median_cov  Mean_cov_geq1   Containment_ind Naive_ANI   Contig_name
mouse_1.fq  gtdb_genomes_reps_r214/database/GCA/910/577/315/GCA_910577315.1_genomic.fna.gz  10.6258 10.1684 97.66   0.656   97.33-98.07 0.656   0.55-0.76   1   1.456   616/2624    95.43   CAJTME010000001.1 TPA_asm: uncultured Muribaculaceae bacterium isolate MGBC104416 genome assembly, contig: MGBC104416.1, whole genome shotgun sequence
mouse_1.fq  gtdb_genomes_reps_r214/database/GCF/001/945/605/GCF_001945605.1_genomic.fna.gz  9.1588  7.5114  98.36   0.565   97.92-98.84 0.565   0.47-0.66   1   1.421   553/2100    95.79   NZ_MPKA01000006.1 Dubosiella newyorkensis strain NYU-BL-A4 NODE_100_length_1023_cov_1250.21_ID_199, whole genome shotgun sequence
mouse_1.fq  gtdb_genomes_reps_r214/database/GCA/910/589/675/GCA_910589675.1_genomic.fna.gz  8.3282  5.1014  95.18   0.514   94.47-95.99 0.514   0.37-0.67   1   1.299   144/1629    92.47   CAJUTO010000001.1 uncultured Lactobacillus sp. isolate MGBC166701 genome assembly, contig: MGBC166701.1, whole genome shotgun sequence
mouse_1.fq  gtdb_genomes_reps_r214/database/GCA/910/588/855/GCA_910588855.1_genomic.fna.gz  8.1294  11.9822 98.20   0.502   97.87-98.59 0.502   0.43-0.56   1   1.345   892/3903    95.35   CAJUSR010000001.1 TPA_asm: uncultured Kineothrix sp. isolate MGBC162921 genome assembly, contig: MGBC162921.1, whole genome shotgun sequence
mouse_1.fq  gtdb_genomes_reps_r214/database/GCA/910/579/675/GCA_910579675.1_genomic.fna.gz  7.7809  7.5080  98.71   0.480   98.27-99.11 0.480   0.40-0.55   1   1.303   654/2522    95.74   CAJTUF010000001.1 TPA_asm: uncultured Muribaculaceae bacterium isolate MGBC114255 genome assembly, contig: MGBC114255.1, whole genome shotgun sequence
mouse_1.fq  gtdb_genomes_reps_r214/database/GCF/001/591/705/GCF_001591705.1_genomic.fna.gz  7.1538  6.0999  96.41   0.441   95.86-97.20 0.441   0.32-0.54   1   1.216   264/2258    93.31   NZ_BCVK01000001.1 Lactococcus lactis subsp. cremoris NBRC 100676, whole genome shotgun sequence
mouse_1.fq  gtdb_genomes_reps_r214/database/GCF/000/403/395/GCF_000403395.2_genomic.fna.gz  6.9472  9.1490  95.75   0.429   95.27-96.45 0.429   0.32-0.51   1   1.251   299/3240    92.60   NZ_KE159657.1 Anaerotruncus sp. G3(2012) strain G3 acPFl-supercont1.1, whole genome shotgun sequence
mouse_1.fq  gtdb_genomes_reps_r214/database/GCA/003/833/075/GCA_003833075.1_genomic.fna.gz  5.7036  7.2196  97.02   0.352   96.46-97.75 0.352   0.27-0.42   1   1.234   394/3324    93.35   RIAY01000001.1 Muribaculaceae bacterium Isolate-036 (Harlan) seq1, whole genome shotgun sequence
mouse_1.fq  gtdb_genomes_reps_r214/database/GCA/910/587/685/GCA_910587685.1_genomic.fna.gz  5.0394  3.2464  97.82   0.311   97.06-98.58 0.311   0.24-0.39   1   1.200   230/1672    93.80   CAJUMY010000001.1 TPA_asm: uncultured Christensenellaceae bacterium isolate MGBC161649 genome assembly, contig: MGBC161649.1, whole genome shotgun sequence

You can check the output format documentation for the definitions of each of the columns.

4. Get a taxonomic profile

Sylph's TSV output has no taxonomic information, only genome information. To get a taxonomic profile, we integrate GTDB's taxonomy with sylph's output. More information can be found here.

Run the following commands:

conda install -c bioconda sylph-tax

# only have to do this once
mkdir taxonomy_file_folder
sylph-tax download --download-to taxonomy_file_folder

sylph-tax taxprof results.tsv -t GTDB_r214 -o prefix_

ls prefix_mouse_1.fq.gz.sylphmpa
ls prefix_B-mouse_2.fq.gz.sylphmpa

Important

-t's metadata file must correspond to database used. See sylph-tax for available database. If you use GTDB-R220 or R214, you must use the correct R220 or R214 taxonomy.

The script outputs a new file called prefix_MYSAMPLENAME.sylphmpa for each sample in the results file. Investigating one file gives the following:

head -n 20 prefix_mouse_1.fq.sylphmpa                                                                                       
#SampleID   mouse_1.fq
clade_name  relative_abundance  sequence_abundance  ANI (if strain-level)
d__Bacteria 100.00010000000002  99.99999999999999   NA
d__Bacteria|p__Bacillota    24.640800000000002  18.712699999999998  NA
d__Bacteria|p__Bacillota_A  47.333499999999994  52.5969 NA
d__Bacteria|p__Bacillota_A|c__Clostridia    47.333499999999994  52.5969 NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales  9.0019  6.252800000000001   NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__Pumilibacteraceae 3.9625  3.0064  NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__Pumilibacteraceae|g__Pumilibacter 3.9625  3.0064  NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__Pumilibacteraceae|g__Pumilibacter|s__Pumilibacter 3.9625  3.0064  NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__Pumilibacteraceae|g__Pumilibacter|s__Pumilibacter|t__GCA_910587595.1  3.9625  3.0064  95.45
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__UBA3700   5.0394  3.2464  NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__UBA3700|g__MGBC161649 5.0394  3.2464  NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__UBA3700|g__MGBC161649|s__MGBC161649   5.0394  3.2464  NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__UBA3700|g__MGBC161649|s__MGBC161649|t__GCA_910587685.1    5.0394  3.2464  97.82
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales  20.3153 24.426900000000003  NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales|f__Lachnospiraceae   20.3153 24.426900000000003  NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales|f__Lachnospiraceae|g__1XD42-69   4.1703  4.1626  NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales|f__Lachnospiraceae|g__1XD42-69|s__1XD42-69   4.1703  4.1626  NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales|f__Lachnospiraceae|g__1XD42-69|s__1XD42-69|t__GCA_910589105.1    4.1703  4.1626  97.47

This file is a taxonomic profile similar to what MetaPhlAn outputs. Each taxonomic rank has an associated taxonomic or sequence abundance.

Note

Taxonomic abundance normalizes by genome sizes, whereas sequence abundance is the % of reads assigned to a taxonomic rank. See Sun et al., 2021 for information.

Note that the t__GCA_... ranks, which are strain/genome-level ranks, have an associated ANI output. The other ranks do not.

5. Conclusion

We have just shown how sylph can do multi-sample profiling against the GTDB-R214 database very efficiently while giving ANI information as well.

To do taxonomic profiling with other databases, see the manual. Alternatively, consider just analyzing sylph's resulting TSV file; it contains some useful information not present in the taxonomic profile.