.sylphmpa output format
.sylphmpa taxonomic profiling output format
*.sylphmpa files look like this:
#SampleID /home/jshaw/projects/temp/amr/short_reads/SRR14739086_1.fastq.gz Taxonomies_used:['GTDB_r220']
clade_name relative_abundance sequence_abundance ANI (if strain-level) Coverage (if strain-level)
d__Bacteria 100.00010000000003 100.00019999999996 NA NA
d__Bacteria|p__Pseudomonadota 100.00010000000003 100.00019999999996 NA NA
d__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria 100.00010000000003 100.00019999999996 NA NA
d__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Enterobacterales 35.6384 36.0603 NA NA
....
Tip
This is a valid TSV file, but rows prefixed with # are comments.
You can read .sylphmpa files with pandas in python like pd.read_csv('output.sylphmpa',sep='\t', comment='#').
Tip
You can visualize this file in pavian if you use the sylph-tax taxprof .... --pavian option (since v1.5.1).
There are five important columns:
clade_name: A string liked__Bacteria|p__Actinomycetota|c__Acidimicrobiia|o__Acidimicrobiales|f__Ilumatobacteraceaethat describes the clade.t__STRAINrepresents the exact genome identifier.relative_abundance: the taxonomic relative abundance of the cladesequence_abundance: the sequence abundance of the clade, i.e. the % of reads assignedANI: this isNAexcept for at the strain level (t__strain). Otherwise it is sylph's ANI estimate.Coverage: This is theEff_covorTrue_covcolumn of sylph's output.
Tip
Viral-host information may be available for pre-built virus databases. The -a option adds a new column in the .sylphmpa files associating viral genomes to their hosts. For example:
r__Duplodnaviria|k__Heunggongvirae|p__Uroviricota|c__Caudoviricetes|||||t__IMGVR_UViG_2503982007_000001 ... d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus epidermidis
indicates that IMGVR_UVIG_2503982007's host is Staphylococcus epidermidis.