Read Classification
KnuttReads2Bins currently supports 3 toolchains to classify the prepared reads. The first is a small chain that filters reads of the SSU rRNA gene and classifies them using SINA. The second one uses Kaiju to search protein databases like NCBIs non-redundant protein set for k-mers found in the reads. The third method uses Sourmash which relies on hashes of nucleotide k-mers.
If looking at Krona reports, make sure to check the fraction of the sample, that actually was classified.
The steps for this stage are defined in the Snakefile_2ClassifyReads file. While the disk space needed for the reference database depends on the selected Kaiju database, with a current nr_euk dataset, around Tbd. GB are required.
- Small metagenome
- Tbd. GB
- Medium metagenome
- Tbd. GB
- Large metagenome
- Tbd. GB
Reference data can be generated with the rule classifyReadsRefData. All classification steps can be run with the classifyReads, classifyReadsKrona and classifyReadsReport rules.
Outputs are stored in <outputdir>/ReadClassification/, <outputdir>/Data/ReadClassification/ and <outputdir>/Reports/.
Table of contents
SSU reads (SINA)
The SSU workflow uses the configured version of the SILVA SSSU NR 99 database to build a BBMap and SINA index for analysis. This can be executed with the SSURefData rule. The files use around 5.5GB disk space.
The filtering and classification of the reads can be run with the SSU rule. A report can be generated with the SSUReportrule.
Output files
output
├── output/Benchmarking
│   └── output/Benchmarking/ReadClassification
│       ├── output/Benchmarking/ReadClassification/sina_SAMPLE.tsv
│       └── output/Benchmarking/ReadClassification/ssu_bbmap_map_SAMPLE.tsv
├── output/Data
│   └── output/Data/ReadClassification
│       └── output/Data/ReadClassification/SAMPLE_readclassification_SSU.tsv
├── output/ReadClassification
│   └── output/ReadClassification/SSU
│       └── output/ReadClassification/SSU/SAMPLE
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_bbmap_SSU.bam
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_bbmap_SSU.fasta
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_bbmap_SSU.out
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_sina_hits.csv
│           ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_sina_hits.fasta
│           └── output/ReadClassification/SSU/SAMPLE/SAMPLE_sina_hits.log
└── output/Reports
    ├── output/Reports/5.1SSUreport.html
    └── output/Reports/SSU_krona.html
See the general file information for the Benchmarking files. The SAMPLE_readclassification_SSU.tsv file combines the results from SINA (SAMPLE_sina_hits.csv) and BBmap (SAMPLE_bbmap_SSU.bam).
It contains the following columns. Every row is a read filtered by BBmap.
- qname
- Full id of the read
- sname
- Full id of the BBMap hit
- strand
- Strand of the BBmap hit (-/+)
- pos
- Start (left-most) part of the alignment
- qwidth
- Read length
- mapq
- Mapping quality, error rate as a PHRED score
- cigar
- CIGAR alignment string
- tag.MD
- BBmap MD tag
- tag.XM
- Number of mismatches (BBMap)
- tag.NM
- Edit distance (BBMap)
- align_cutoff_head_slv
- Unaligned bases at the beginning of the read (SINA)
- align_cutoff_tail_slv
- Unaligned bases at the end of the read (SINA)
- align_filter_slv
- Positional variability by parsimony filter applied (Default: NA)
- align_quality_slv
- Highest similarity to a reference
- aligned_slv
- Date and time of the alignment
- lca_tax_embl
- LCA result of the EMBL-EBI/ENA taxa
- lca_tax_gg
- LCA result of the Greengenes taxa
- lca_tax_gtdb
- LCA result of the GTDB taxa (recommended)
- lca_tax_rdp
- LCA result of the RDP taxa
- lca_tax_slv
- LCA result of the SILVA taxa
- nearest_slv
- References found in the homology search. Space seperated:
 ID.VERSION.START.END~IDENTITY
- turn
- Did SINA reverse and/or complement the read?
Protein reads (Kaiju)
Kaiju can use different databases. Their usage and construction can be quite memory hungry and time intensive. You may want to change the thread count in the Snakefile. It is also possible to download the .fmi file from the Kaiju website and copy it to the appropiate location (You can see it when running the workflow with the -rn flags). Reference data generation is executed with the kaijuRefData rule.
The classification for all samples is run with the kaiju rule. A report can be generated with the kaijuReportrule. The Krona report (kaijuKrona) doesn’t contains the database.
Output files
output
├── output/Benchmarking
│   └── output/Benchmarking/ReadClassification
│       └── output/Benchmarking/ReadClassification/kaiju_SAMPLE.tsv
├── output/Data
│   └── output/Data/ReadClassification
│       └── output/Data/ReadClassification/SAMPLE_readclassification_kaiju.tsv
├── output/ReadClassification
│   └── output/ReadClassification/kaiju
│       └── output/ReadClassification/kaiju/SAMPLE
│           └── output/ReadClassification/kaiju/SAMPLE/SAMPLE_kaiju.log
└── output/Reports
    │── output/Reports/5.2kaijureport.html
    └── output/Reports/kaiju_krona.html
See the general file information for the Benchmarking. The SAMPLE_readclassification_kaiju.tsvcontains the results. It contains rows for all reads.
- qname
- Full id of the read
- classified
- Was this read classified (U or C)
- taxid
- Assigned NCBI tax id
- lengthorscore
- Length (MEM) or score (Greedy) of this match
- match_taxids
- Tax ids of sequences (max 20) with the best match
- match_ascs
- Accessions of those sequences
- match_fragments
- Matching fragments
- tax
- Classification tax string, seperated by ;
Polynucleotide hash sketches (Sourmash)
Sourmash is really fast and memory efficient. The default settings use an high k-mer length to provide more specific results compared to the Kaiju analysis. The taxonomy and the available genoems are based on the genome taxanomy database (GTDB) project. The hash count is proportionally to the read count. While this preserves the diversity of the reads, it makes comparing samples of uneven depths even harder to compare.
Download the reference database with sourmashRefData, run all analysis steps with sourmash and produce the report with sourmashReport. The Sourmash Krona report also doesn’t contain the database (sourmashKrona).
Output files
output
├── output/Benchmarking
│   └── output/Benchmarking/ReadClassification
│       ├── output/Benchmarking/ReadClassification/sourmash_comparison.tsv
│       ├── output/Benchmarking/ReadClassification/sourmash_description_SAMPLE.tsv
│       ├── output/Benchmarking/ReadClassification/sourmash_gather_SAMPLE.tsv
│       ├── output/Benchmarking/ReadClassification/sourmash_k51_SAMPLE.tsv
│       └── output/Benchmarking/ReadClassification/sourmash_lca_SAMPLE.tsv
├── output/Data
│   └── output/Data/ReadClassification
│       ├── output/Data/ReadClassification/SAMPLE_readclassification_sourmash_gather.tsv
│       ├── output/Data/ReadClassification/SAMPLE_readclassification_sourmash_lca.tsv
│       ├── output/Data/ReadClassification/SAMPLE_sourmash_signature.tsv
│       ├── output/Data/ReadClassification/PFL9_meta_sourmash_signature.tsv
│       ├── output/Data/ReadClassification/sourmash_sample_comparison
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.bin
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.bin.labels.txt
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.dendro.png
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.hist.png
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.labels.txt
│       ├── output/Data/ReadClassification/sourmash_sample_comparison.matrix.png
│       └── output/Data/ReadClassification/sourmash_sample_comparison.tsv
├── output/ReadClassification
│   └── output/ReadClassification/sourmash
│       ├── output/ReadClassification/sourmash/SAMPLE
│       │   ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_descr.log
│       │   ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_gather.log
│       │   ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_k51.log
│       │   ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_k51.sig
│       │   └── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_lca.log
│       └── output/ReadClassification/sourmash/sourmash_sample_comparison.log
└── output/Reports
    ├── output/Reports/5.3sourmashReport.html
    └── output/Reports/sourmash_krona.html
See the general file information for the Benchmarking. SAMPLE_readclassification_sourmash_gather.tsv gives a list of genomes in the database found in the sample.
- intersect_bp
- Base pairs (approximate) shared between sample and genome
- f_orig_query
- Fraction of the sample in the genome
- f_match
- Fraction of the genome in the sample
- f_unique_to_query
- Fraction of the sample that is only represented in this genome
- f_unique_weighted
- f_unique_to_query weighted by hash abundance
- average_abund
- Average abundace of the hashes found in the sample
- median_abund
- Median abundace of the hashes found in the sample
- std_abund
- Standard deviation abundace of the hashes found in the sample
- name
- Genome name
- filename
- Reference data file
- md5
- MD5 checksum of the genome
The file SAMPLE_readclassification_sourmash_lca.tsv uses the hashes to provide a taxonomic composition of the sample. While it looks like a Krona data file, the count field gives the number of assigned hashes and summed child counts.
SAMPLE_sourmash_signature.tsv provides information on the generated sketch/signature files.
- intersect_bp
- Base pairs (approximate) shared between sample and genome
- f_orig_query
- Fraction of the sample in the genome
- f_match
- Fraction of the genome in the sample
- f_unique_to_query
- Fraction of the sample that is only represented in this genome
- f_unique_weighted
- f_unique_to_query weighted by hash abundance
- average_abund
- Average abundace of the hashes found in the sample
- median_abund
- Median abundace of the hashes found in the sample
- std_abund
- Standard deviation abundace of the hashes found in the sample
- name
- Genome name
- filename
- Reference data file
- md5
- MD5 checksum of the genome
sourmash_comparison.tsv contains the comparison of the samples based on the sourmash hashes using the Jaccard similarity.