Link

Read Classification

KnuttReads2Bins currently supports 3 toolchains to classify the prepared reads. The first is a small chain that filters reads of the SSU rRNA gene and classifies them using SINA. The second one uses Kaiju to search protein databases like NCBIs non-redundant protein set for k-mers found in the reads. The third method uses Sourmash which relies on hashes of nucleotide k-mers.

If looking at Krona reports, make sure to check the fraction of the sample, that actually was classified.

The steps for this stage are defined in the Snakefile_2ClassifyReads file. While the disk space needed for the reference database depends on the selected Kaiju database, with a current nr_euk dataset, around Tbd. GB are required.

Small metagenome
Tbd. GB
Medium metagenome
Tbd. GB
Large metagenome
Tbd. GB

Reference data can be generated with the rule classifyReadsRefData. All classification steps can be run with the classifyReads, classifyReadsKrona and classifyReadsReport rules.

Outputs are stored in <outputdir>/ReadClassification/, <outputdir>/Data/ReadClassification/ and <outputdir>/Reports/.


Table of contents

  1. SSU reads (SINA)
  2. Protein reads (Kaiju)
  3. Polynucleotide hash sketches (Sourmash)

SSU reads (SINA)

The SSU workflow uses the configured version of the SILVA SSSU NR 99 database to build a BBMap and SINA index for analysis. This can be executed with the SSURefData rule. The files use around 5.5GB disk space.

The filtering and classification of the reads can be run with the SSU rule. A report can be generated with the SSUReportrule.

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadClassification
│       ├── output/Benchmarking/ReadClassification/sina_SAMPLE.tsv
│       └── output/Benchmarking/ReadClassification/ssu_bbmap_map_SAMPLE.tsv
├── output/Data
│   └── output/Data/ReadClassification
│       └── output/Data/ReadClassification/SAMPLE_readclassification_SSU.tsv
├── output/ReadClassification
│   └── output/ReadClassification/SSU
│       └── output/ReadClassification/SSU/SAMPLE
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_bbmap_SSU.bam
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_bbmap_SSU.fasta
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_bbmap_SSU.out
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_sina_hits.csv
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_sina_hits.fasta
│           └── output/ReadClassification/SSU/SAMPLE/SAMPLE_sina_hits.log
└── output/Reports
    ├── output/Reports/5.1SSUreport.html
    └── output/Reports/SSU_krona.html

See the general file information for the Benchmarking files. The SAMPLE_readclassification_SSU.tsv file combines the results from SINA (SAMPLE_sina_hits.csv) and BBmap (SAMPLE_bbmap_SSU.bam).

It contains the following columns. Every row is a read filtered by BBmap.

qname
Full id of the read
sname
Full id of the BBMap hit
strand
Strand of the BBmap hit (-/+)
pos
Start (left-most) part of the alignment
qwidth
Read length
mapq
Mapping quality, error rate as a PHRED score
cigar
CIGAR alignment string
tag.MD
BBmap MD tag
tag.XM
Number of mismatches (BBMap)
tag.NM
Edit distance (BBMap)
align_cutoff_head_slv
Unaligned bases at the beginning of the read (SINA)
align_cutoff_tail_slv
Unaligned bases at the end of the read (SINA)
align_filter_slv
Positional variability by parsimony filter applied (Default: NA)
align_quality_slv
Highest similarity to a reference
aligned_slv
Date and time of the alignment
lca_tax_embl
LCA result of the EMBL-EBI/ENA taxa
lca_tax_gg
LCA result of the Greengenes taxa
lca_tax_gtdb
LCA result of the GTDB taxa (recommended)
lca_tax_rdp
LCA result of the RDP taxa
lca_tax_slv
LCA result of the SILVA taxa
nearest_slv
References found in the homology search. Space seperated:
ID.VERSION.START.END~IDENTITY
turn
Did SINA reverse and/or complement the read?

Protein reads (Kaiju)

Kaiju can use different databases. Their usage and construction can be quite memory hungry and time intensive. You may want to change the thread count in the Snakefile. It is also possible to download the .fmi file from the Kaiju website and copy it to the appropiate location (You can see it when running the workflow with the -rn flags). Reference data generation is executed with the kaijuRefData rule.

The classification for all samples is run with the kaiju rule. A report can be generated with the kaijuReportrule. The Krona report (kaijuKrona) doesn’t contains the database.

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadClassification
│       └── output/Benchmarking/ReadClassification/kaiju_SAMPLE.tsv
├── output/Data
│   └── output/Data/ReadClassification
│       └── output/Data/ReadClassification/SAMPLE_readclassification_kaiju.tsv
├── output/ReadClassification
│   └── output/ReadClassification/kaiju
│       └── output/ReadClassification/kaiju/SAMPLE
│           └── output/ReadClassification/kaiju/SAMPLE/SAMPLE_kaiju.log
└── output/Reports
    │── output/Reports/5.2kaijureport.html
    └── output/Reports/kaiju_krona.html

See the general file information for the Benchmarking. The SAMPLE_readclassification_kaiju.tsvcontains the results. It contains rows for all reads.

qname
Full id of the read
classified
Was this read classified (U or C)
taxid
Assigned NCBI tax id
lengthorscore
Length (MEM) or score (Greedy) of this match
match_taxids
Tax ids of sequences (max 20) with the best match
match_ascs
Accessions of those sequences
match_fragments
Matching fragments
tax
Classification tax string, seperated by ;

Polynucleotide hash sketches (Sourmash)

Sourmash is really fast and memory efficient. The default settings use an high k-mer length to provide more specific results compared to the Kaiju analysis. The taxonomy and the available genoems are based on the genome taxanomy database (GTDB) project. The hash count is proportionally to the read count. While this preserves the diversity of the reads, it makes comparing samples of uneven depths even harder to compare.

Download the reference database with sourmashRefData, run all analysis steps with sourmash and produce the report with sourmashReport. The Sourmash Krona report also doesn’t contain the database (sourmashKrona).

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadClassification
│       ├── output/Benchmarking/ReadClassification/sourmash_comparison.tsv
│       ├── output/Benchmarking/ReadClassification/sourmash_description_SAMPLE.tsv
│       ├── output/Benchmarking/ReadClassification/sourmash_gather_SAMPLE.tsv
│       ├── output/Benchmarking/ReadClassification/sourmash_k51_SAMPLE.tsv
│       └── output/Benchmarking/ReadClassification/sourmash_lca_SAMPLE.tsv
├── output/Data
│   └── output/Data/ReadClassification
│       ├── output/Data/ReadClassification/SAMPLE_readclassification_sourmash_gather.tsv
│       ├── output/Data/ReadClassification/SAMPLE_readclassification_sourmash_lca.tsv
│       ├── output/Data/ReadClassification/SAMPLE_sourmash_signature.tsv
│       ├── output/Data/ReadClassification/PFL9_meta_sourmash_signature.tsv
│       ├── output/Data/ReadClassification/sourmash_sample_comparison
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.bin
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.bin.labels.txt
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.dendro.png
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.hist.png
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.labels.txt
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.matrix.png
│       └── output/Data/ReadClassification/sourmash_sample_comparison.tsv
├── output/ReadClassification
│   └── output/ReadClassification/sourmash
│       ├── output/ReadClassification/sourmash/SAMPLE
│       │   ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_descr.log
│       │   ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_gather.log
│       │   ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_k51.log
│       │   ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_k51.sig
│       │   └── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_lca.log
│       └── output/ReadClassification/sourmash/sourmash_sample_comparison.log
└── output/Reports
    ├── output/Reports/5.3sourmashReport.html
    └── output/Reports/sourmash_krona.html

See the general file information for the Benchmarking. SAMPLE_readclassification_sourmash_gather.tsv gives a list of genomes in the database found in the sample.

intersect_bp
Base pairs (approximate) shared between sample and genome
f_orig_query
Fraction of the sample in the genome
f_match
Fraction of the genome in the sample
f_unique_to_query
Fraction of the sample that is only represented in this genome
f_unique_weighted
f_unique_to_query weighted by hash abundance
average_abund
Average abundace of the hashes found in the sample
median_abund
Median abundace of the hashes found in the sample
std_abund
Standard deviation abundace of the hashes found in the sample
name
Genome name
filename
Reference data file
md5
MD5 checksum of the genome

The file SAMPLE_readclassification_sourmash_lca.tsv uses the hashes to provide a taxonomic composition of the sample. While it looks like a Krona data file, the count field gives the number of assigned hashes and summed child counts.

SAMPLE_sourmash_signature.tsv provides information on the generated sketch/signature files.

intersect_bp
Base pairs (approximate) shared between sample and genome
f_orig_query
Fraction of the sample in the genome
f_match
Fraction of the genome in the sample
f_unique_to_query
Fraction of the sample that is only represented in this genome
f_unique_weighted
f_unique_to_query weighted by hash abundance
average_abund
Average abundace of the hashes found in the sample
median_abund
Median abundace of the hashes found in the sample
std_abund
Standard deviation abundace of the hashes found in the sample
name
Genome name
filename
Reference data file
md5
MD5 checksum of the genome

sourmash_comparison.tsv contains the comparison of the samples based on the sourmash hashes using the Jaccard similarity.