Read Classification
KnuttReads2Bins currently supports 3 toolchains to classify the prepared reads. The first is a small chain that filters reads of the SSU rRNA gene and classifies them using SINA. The second one uses Kaiju to search protein databases like NCBIs non-redundant protein set for k-mers found in the reads. The third method uses Sourmash which relies on hashes of nucleotide k-mers.
If looking at Krona reports, make sure to check the fraction of the sample, that actually was classified.
The steps for this stage are defined in the Snakefile_2ClassifyReads
file. While the disk space needed for the reference database depends on the selected Kaiju database, with a current nr_euk
dataset, around Tbd. GB are required.
- Small metagenome
- Tbd. GB
- Medium metagenome
- Tbd. GB
- Large metagenome
- Tbd. GB
Reference data can be generated with the rule classifyReadsRefData
. All classification steps can be run with the classifyReads
, classifyReadsKrona
and classifyReadsReport
rules.
Outputs are stored in <outputdir>/ReadClassification/
, <outputdir>/Data/ReadClassification/
and <outputdir>/Reports/
.
Table of contents
SSU reads (SINA)
The SSU workflow uses the configured version of the SILVA SSSU NR 99 database to build a BBMap and SINA index for analysis. This can be executed with the SSURefData
rule. The files use around 5.5GB disk space.
The filtering and classification of the reads can be run with the SSU
rule. A report can be generated with the SSUReport
rule.
Output files
output
├── output/Benchmarking
│ └── output/Benchmarking/ReadClassification
│ ├── output/Benchmarking/ReadClassification/sina_SAMPLE.tsv
│ └── output/Benchmarking/ReadClassification/ssu_bbmap_map_SAMPLE.tsv
├── output/Data
│ └── output/Data/ReadClassification
│ └── output/Data/ReadClassification/SAMPLE_readclassification_SSU.tsv
├── output/ReadClassification
│ └── output/ReadClassification/SSU
│ └── output/ReadClassification/SSU/SAMPLE
│ ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_bbmap_SSU.bam
│ ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_bbmap_SSU.fasta
│ ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_bbmap_SSU.out
│ ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_sina_hits.csv
│ ├── output/ReadClassification/SSU/SAMPLE/SAMPLE_sina_hits.fasta
│ └── output/ReadClassification/SSU/SAMPLE/SAMPLE_sina_hits.log
└── output/Reports
├── output/Reports/5.1SSUreport.html
└── output/Reports/SSU_krona.html
See the general file information for the Benchmarking
files. The SAMPLE_readclassification_SSU.tsv
file combines the results from SINA (SAMPLE_sina_hits.csv
) and BBmap (SAMPLE_bbmap_SSU.bam
).
It contains the following columns. Every row is a read filtered by BBmap.
- qname
- Full id of the read
- sname
- Full id of the BBMap hit
- strand
- Strand of the BBmap hit (-/+)
- pos
- Start (left-most) part of the alignment
- qwidth
- Read length
- mapq
- Mapping quality, error rate as a PHRED score
- cigar
- CIGAR alignment string
- tag.MD
- BBmap MD tag
- tag.XM
- Number of mismatches (BBMap)
- tag.NM
- Edit distance (BBMap)
- align_cutoff_head_slv
- Unaligned bases at the beginning of the read (SINA)
- align_cutoff_tail_slv
- Unaligned bases at the end of the read (SINA)
- align_filter_slv
- Positional variability by parsimony filter applied (Default: NA)
- align_quality_slv
- Highest similarity to a reference
- aligned_slv
- Date and time of the alignment
- lca_tax_embl
- LCA result of the EMBL-EBI/ENA taxa
- lca_tax_gg
- LCA result of the Greengenes taxa
- lca_tax_gtdb
- LCA result of the GTDB taxa (recommended)
- lca_tax_rdp
- LCA result of the RDP taxa
- lca_tax_slv
- LCA result of the SILVA taxa
- nearest_slv
- References found in the homology search. Space seperated:
ID.VERSION.START.END~IDENTITY - turn
- Did SINA reverse and/or complement the read?
Protein reads (Kaiju)
Kaiju can use different databases. Their usage and construction can be quite memory hungry and time intensive. You may want to change the thread count in the Snakefile. It is also possible to download the .fmi
file from the Kaiju website and copy it to the appropiate location (You can see it when running the workflow with the -rn
flags). Reference data generation is executed with the kaijuRefData
rule.
The classification for all samples is run with the kaiju
rule. A report can be generated with the kaijuReport
rule. The Krona report (kaijuKrona
) doesn’t contains the database.
Output files
output
├── output/Benchmarking
│ └── output/Benchmarking/ReadClassification
│ └── output/Benchmarking/ReadClassification/kaiju_SAMPLE.tsv
├── output/Data
│ └── output/Data/ReadClassification
│ └── output/Data/ReadClassification/SAMPLE_readclassification_kaiju.tsv
├── output/ReadClassification
│ └── output/ReadClassification/kaiju
│ └── output/ReadClassification/kaiju/SAMPLE
│ └── output/ReadClassification/kaiju/SAMPLE/SAMPLE_kaiju.log
└── output/Reports
│── output/Reports/5.2kaijureport.html
└── output/Reports/kaiju_krona.html
See the general file information for the Benchmarking
. The SAMPLE_readclassification_kaiju.tsv
contains the results. It contains rows for all reads.
- qname
- Full id of the read
- classified
- Was this read classified (U or C)
- taxid
- Assigned NCBI tax id
- lengthorscore
- Length (MEM) or score (Greedy) of this match
- match_taxids
- Tax ids of sequences (max 20) with the best match
- match_ascs
- Accessions of those sequences
- match_fragments
- Matching fragments
- tax
- Classification tax string, seperated by ;
Polynucleotide hash sketches (Sourmash)
Sourmash is really fast and memory efficient. The default settings use an high k-mer length to provide more specific results compared to the Kaiju analysis. The taxonomy and the available genoems are based on the genome taxanomy database (GTDB) project. The hash count is proportionally to the read count. While this preserves the diversity of the reads, it makes comparing samples of uneven depths even harder to compare.
Download the reference database with sourmashRefData
, run all analysis steps with sourmash
and produce the report with sourmashReport
. The Sourmash Krona report also doesn’t contain the database (sourmashKrona
).
Output files
output
├── output/Benchmarking
│ └── output/Benchmarking/ReadClassification
│ ├── output/Benchmarking/ReadClassification/sourmash_comparison.tsv
│ ├── output/Benchmarking/ReadClassification/sourmash_description_SAMPLE.tsv
│ ├── output/Benchmarking/ReadClassification/sourmash_gather_SAMPLE.tsv
│ ├── output/Benchmarking/ReadClassification/sourmash_k51_SAMPLE.tsv
│ └── output/Benchmarking/ReadClassification/sourmash_lca_SAMPLE.tsv
├── output/Data
│ └── output/Data/ReadClassification
│ ├── output/Data/ReadClassification/SAMPLE_readclassification_sourmash_gather.tsv
│ ├── output/Data/ReadClassification/SAMPLE_readclassification_sourmash_lca.tsv
│ ├── output/Data/ReadClassification/SAMPLE_sourmash_signature.tsv
│ ├── output/Data/ReadClassification/PFL9_meta_sourmash_signature.tsv
│ ├── output/Data/ReadClassification/sourmash_sample_comparison
│ ├── output/Data/ReadClassification/sourmash_sample_comparison.bin
│ ├── output/Data/ReadClassification/sourmash_sample_comparison.bin.labels.txt
│ ├── output/Data/ReadClassification/sourmash_sample_comparison.dendro.png
│ ├── output/Data/ReadClassification/sourmash_sample_comparison.hist.png
│ ├── output/Data/ReadClassification/sourmash_sample_comparison.labels.txt
│ ├── output/Data/ReadClassification/sourmash_sample_comparison.matrix.png
│ └── output/Data/ReadClassification/sourmash_sample_comparison.tsv
├── output/ReadClassification
│ └── output/ReadClassification/sourmash
│ ├── output/ReadClassification/sourmash/SAMPLE
│ │ ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_descr.log
│ │ ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_gather.log
│ │ ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_k51.log
│ │ ├── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_k51.sig
│ │ └── output/ReadClassification/sourmash/SAMPLE/SAMPLE_sourmash_lca.log
│ └── output/ReadClassification/sourmash/sourmash_sample_comparison.log
└── output/Reports
├── output/Reports/5.3sourmashReport.html
└── output/Reports/sourmash_krona.html
See the general file information for the Benchmarking
. SAMPLE_readclassification_sourmash_gather.tsv
gives a list of genomes in the database found in the sample.
- intersect_bp
- Base pairs (approximate) shared between sample and genome
- f_orig_query
- Fraction of the sample in the genome
- f_match
- Fraction of the genome in the sample
- f_unique_to_query
- Fraction of the sample that is only represented in this genome
- f_unique_weighted
- f_unique_to_query weighted by hash abundance
- average_abund
- Average abundace of the hashes found in the sample
- median_abund
- Median abundace of the hashes found in the sample
- std_abund
- Standard deviation abundace of the hashes found in the sample
- name
- Genome name
- filename
- Reference data file
- md5
- MD5 checksum of the genome
The file SAMPLE_readclassification_sourmash_lca.tsv
uses the hashes to provide a taxonomic composition of the sample. While it looks like a Krona data file, the count
field gives the number of assigned hashes and summed child counts.
SAMPLE_sourmash_signature.tsv
provides information on the generated sketch/signature files.
- intersect_bp
- Base pairs (approximate) shared between sample and genome
- f_orig_query
- Fraction of the sample in the genome
- f_match
- Fraction of the genome in the sample
- f_unique_to_query
- Fraction of the sample that is only represented in this genome
- f_unique_weighted
- f_unique_to_query weighted by hash abundance
- average_abund
- Average abundace of the hashes found in the sample
- median_abund
- Median abundace of the hashes found in the sample
- std_abund
- Standard deviation abundace of the hashes found in the sample
- name
- Genome name
- filename
- Reference data file
- md5
- MD5 checksum of the genome
sourmash_comparison.tsv
contains the comparison of the samples based on the sourmash hashes using the Jaccard similarity.