Read classification

Read Classification

KnuttReads2Bins currently supports 3 toolchains to classify the prepared reads. The first is a small chain that filters reads of the SSU rRNA gene and classifies them using SINA. The second one uses Kaiju to search protein databases like NCBIs non-redundant protein set for k-mers found in the reads. The third method uses Sourmash which relies on hashes of nucleotide k-mers.

If looking at Krona reports, make sure to check the fraction of the sample, that actually was classified.

The steps for this stage are defined in the Snakefile_2ClassifyReads file. While the disk space needed for the reference database depends on the selected Kaiju database, with a current nr_euk dataset, around Tbd. GB are required.

Small metagenome: Tbd. GB
Medium metagenome: Tbd. GB
Large metagenome: Tbd. GB

Reference data can be generated with the rule classifyReadsRefData. All classification steps can be run with the classifyReads, classifyReadsKrona and classifyReadsReport rules.

Outputs are stored in <outputdir>/ReadClassification/, <outputdir>/Data/ReadClassification/ and <outputdir>/Reports/.

SSU reads (SINA)
Protein reads (Kaiju)
Polynucleotide hash sketches (Sourmash)

SSU reads (SINA)

The SSU workflow uses the configured version of the SILVA SSSU NR 99 database to build a BBMap and SINA index for analysis. This can be executed with the SSURefData rule. The files use around 5.5GB disk space.

The filtering and classification of the reads can be run with the SSU rule. A report can be generated with the SSUReportrule.

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadClassification
│       ├── output/Benchmarking/ReadClassification/sina_SAMPLE.tsv
│       └── output/Benchmarking/ReadClassification/ssu_bbmap_map_SAMPLE.tsv
├── output/Data
│   └── output/Data/ReadClassification
│       └── output/Data/ReadClassification/SAMPLE_readclassification_SSU.tsv
├── output/ReadClassification
│   └── output/ReadClassification/SSU
│       └── output/ReadClassification/SSU/SAMPLE
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_bbmap_SSU.bam
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_bbmap_SSU.fasta
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_bbmap_SSU.out
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_sina_hits.csv
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_sina_hits.fasta
│           └── output/ReadClassification/SSU/SAMPLE/SAMPLE_sina_hits.log
└── output/Reports
    ├── output/Reports/5.1SSUreport.html
    └── output/Reports/SSU_krona.html

See the general file information for the Benchmarking files. The SAMPLE_readclassification_SSU.tsv file combines the results from SINA (SAMPLE_sina_hits.csv) and BBmap (SAMPLE_bbmap_SSU.bam).

It contains the following columns. Every row is a read filtered by BBmap.

qname: Full id of the read
sname: Full id of the BBMap hit
strand: Strand of the BBmap hit (-/+)
pos: Start (left-most) part of the alignment
qwidth: Read length
mapq: Mapping quality, error rate as a PHRED score
cigar: CIGAR alignment string
tag.MD: BBmap MD tag
tag.XM: Number of mismatches (BBMap)
tag.NM: Edit distance (BBMap)
align_cutoff_head_slv: Unaligned bases at the beginning of the read (SINA)
align_cutoff_tail_slv: Unaligned bases at the end of the read (SINA)
align_filter_slv: Positional variability by parsimony filter applied (Default: NA)
align_quality_slv: Highest similarity to a reference
aligned_slv: Date and time of the alignment
lca_tax_embl: LCA result of the EMBL-EBI/ENA taxa
lca_tax_gg: LCA result of the Greengenes taxa
lca_tax_gtdb: LCA result of the GTDB taxa (recommended)
lca_tax_rdp: LCA result of the RDP taxa
lca_tax_slv: LCA result of the SILVA taxa
nearest_slv: References found in the homology search. Space seperated:
ID.VERSION.START.END~IDENTITY
turn: Did SINA reverse and/or complement the read?

Protein reads (Kaiju)

Kaiju can use different databases. Their usage and construction can be quite memory hungry and time intensive. You may want to change the thread count in the Snakefile. It is also possible to download the .fmi file from the Kaiju website and copy it to the appropiate location (You can see it when running the workflow with the -rn flags). Reference data generation is executed with the kaijuRefData rule.

The classification for all samples is run with the kaiju rule. A report can be generated with the kaijuReportrule. The Krona report (kaijuKrona) doesn’t contains the database.

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadClassification
│       └── output/Benchmarking/ReadClassification/kaiju_SAMPLE.tsv
├── output/Data
│   └── output/Data/ReadClassification
│       └── output/Data/ReadClassification/SAMPLE_readclassification_kaiju.tsv
├── output/ReadClassification
│   └── output/ReadClassification/kaiju
│       └── output/ReadClassification/kaiju/SAMPLE
│           └── output/ReadClassification/kaiju/SAMPLE/SAMPLE_kaiju.log
└── output/Reports
    │── output/Reports/5.2kaijureport.html
    └── output/Reports/kaiju_krona.html

See the general file information for the Benchmarking. The SAMPLE_readclassification_kaiju.tsvcontains the results. It contains rows for all reads.

qname: Full id of the read
classified: Was this read classified (U or C)
taxid: Assigned NCBI tax id
lengthorscore: Length (MEM) or score (Greedy) of this match
match_taxids: Tax ids of sequences (max 20) with the best match
match_ascs: Accessions of those sequences
match_fragments: Matching fragments
tax: Classification tax string, seperated by ;

Polynucleotide hash sketches (Sourmash)

Sourmash is really fast and memory efficient. The default settings use an high k-mer length to provide more specific results compared to the Kaiju analysis. The taxonomy and the available genoems are based on the genome taxanomy database (GTDB) project. The hash count is proportionally to the read count. While this preserves the diversity of the reads, it makes comparing samples of uneven depths even harder to compare.

Download the reference database with sourmashRefData, run all analysis steps with sourmash and produce the report with sourmashReport. The Sourmash Krona report also doesn’t contain the database (sourmashKrona).

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadClassification
│       ├── output/Benchmarking/ReadClassification/sourmash_comparison.tsv
│       ├── output/Benchmarking/ReadClassification/sourmash_description_SAMPLE.tsv
│       ├── output/Benchmarking/ReadClassification/sourmash_gather_SAMPLE.tsv
│       ├── output/Benchmarking/ReadClassification/sourmash_k51_SAMPLE.tsv
│       └── output/Benchmarking/ReadClassification/sourmash_lca_SAMPLE.tsv
├── output/Data
│   └── output/Data/ReadClassification
│       ├── output/Data/ReadClassification/SAMPLE_readclassification_sourmash_gather.tsv
│       ├── output/Data/ReadClassification/SAMPLE_readclassification_sourmash_lca.tsv
│       ├── output/Data/ReadClassification/SAMPLE_sourmash_signature.tsv
│       ├── output/Data/ReadClassification/PFL9_meta_sourmash_signature.tsv
│       ├── output/Data/ReadClassification/sourmash_sample_comparison
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.bin
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.bin.labels.txt
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.dendro.png
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.hist.png
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.labels.txt
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.matrix.png
│       └── output/Data/ReadClassification/sourmash_sample_comparison.tsv
├── output/ReadClassification
│   └── output/ReadClassification/sourmash
│       ├── output/ReadClassification/sourmash/SAMPLE
│       │   ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_descr.log
│       │   ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_gather.log
│       │   ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_k51.log
│       │   ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_k51.sig
│       │   └── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_lca.log
│       └── output/ReadClassification/sourmash/sourmash_sample_comparison.log
└── output/Reports
    ├── output/Reports/5.3sourmashReport.html
    └── output/Reports/sourmash_krona.html

See the general file information for the Benchmarking. SAMPLE_readclassification_sourmash_gather.tsv gives a list of genomes in the database found in the sample.

intersect_bp: Base pairs (approximate) shared between sample and genome
f_orig_query: Fraction of the sample in the genome
f_match: Fraction of the genome in the sample
f_unique_to_query: Fraction of the sample that is only represented in this genome
f_unique_weighted: f_unique_to_query weighted by hash abundance
average_abund: Average abundace of the hashes found in the sample
median_abund: Median abundace of the hashes found in the sample
std_abund: Standard deviation abundace of the hashes found in the sample
name: Genome name
filename: Reference data file
md5: MD5 checksum of the genome

The file SAMPLE_readclassification_sourmash_lca.tsv uses the hashes to provide a taxonomic composition of the sample. While it looks like a Krona data file, the count field gives the number of assigned hashes and summed child counts.

SAMPLE_sourmash_signature.tsv provides information on the generated sketch/signature files.

intersect_bp: Base pairs (approximate) shared between sample and genome
f_orig_query: Fraction of the sample in the genome
f_match: Fraction of the genome in the sample
f_unique_to_query: Fraction of the sample that is only represented in this genome
f_unique_weighted: f_unique_to_query weighted by hash abundance
average_abund: Average abundace of the hashes found in the sample
median_abund: Median abundace of the hashes found in the sample
std_abund: Standard deviation abundace of the hashes found in the sample
name: Genome name
filename: Reference data file
md5: MD5 checksum of the genome

sourmash_comparison.tsv contains the comparison of the samples based on the sourmash hashes using the Jaccard similarity.