Binning
The final step of the KnuttReads2Bins pipeline, the binning of the contigs into MAGs/bins is executed using METABAT2 in the Snakefile_5Binning
file.
- Small metagenome
- Tbd. GB
- Medium metagenome
- Tbd. GB
- Large metagenome
- Tbd. GB
The binning
rule executes METABAT and CheckM. The bins can be also classified with Sourmash (binningSourmash
) and BAT (bat
). BAT uses the classification of the contigs from CAT. The taxonomic composition for the bins based on the CAT classification can be generated with the batKrona
rule. A report is available with binningReport
.
If you are interested in the relationship of your bins to published genomes/bins checkout KnuttBinPhylo.
Output files
output
├── output/Benchmarking
│ └── output/Benchmarking/Binning
│ ├── output/Benchmarking/Binning/bat_SAMPLE.tsv
│ ├── output/Benchmarking/Binning/binning_report.tsv
│ ├── output/Benchmarking/Binning/checkm_extra_SAMPLE.tsv
│ ├── output/Benchmarking/Binning/checkm_wf_SAMPLE.tsv
│ ├── output/Benchmarking/Binning/metabat_SAMPLE.tsv
│ ├── output/Benchmarking/Binning/sourmash_bins_description_SAMPLE.tsv
│ ├── output/Benchmarking/Binning/sourmash_bins_k51_SAMPLE.tsv
│ └── output/Benchmarking/Binning/sourmash_classification_SAMPLE.tsv
├── output/Binning
│ ├── output/Binning/BAT
│ │ └── output/Binning/BAT/SAMPLE
│ │ ├── output/Binning/BAT/SAMPLE/SAMPLE_BAT.log
│ │ ├── output/Binning/BAT/SAMPLE/SAMPLE.bin2classification.named.txt
│ │ ├── output/Binning/BAT/SAMPLE/SAMPLE.bin2classification.txt
│ │ ├── output/Binning/BAT/SAMPLE/SAMPLE.bins.alignment.diamond
│ │ ├── output/Binning/BAT/SAMPLE/SAMPLE.bins.predicted_proteins.faa
│ │ ├── output/Binning/BAT/SAMPLE/SAMPLE.log
│ │ └── output/Binning/BAT/SAMPLE/SAMPLE.ORF2LCA.txt
│ ├── output/Binning/CheckM
│ │ └── output/Binning/CheckM/SAMPLE
│ │ ├── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_data
│ │ │ ├── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_data/bins
│ │ │ │ └── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_data/bins/SAMPLE.X
│ │ │ │ └── ...
│ │ │ ├── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_data/checkm.log
│ │ │ ├── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_data/lineage.ms
│ │ │ └── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_data/storage
│ │ │ └── ...
│ │ ├── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_helperfiles.log
│ │ └── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_wf.log
│ ├── output/Binning/METABAT
│ │ └── output/Binning/METABAT/SAMPLE
│ │ ├── output/Binning/METABAT/SAMPLE/bins
│ │ │ ├── output/Binning/METABAT/SAMPLE/bins/SAMPLE.X.fa
│ │ │ └── ...
│ │ ├── output/Binning/METABAT/SAMPLE/SAMPLE.lowDepth.fa
│ │ ├── output/Binning/METABAT/SAMPLE/SAMPLE_metabat2.log
│ │ ├── output/Binning/METABAT/SAMPLE/SAMPLE.tooShort.fa
│ │ └── output/Binning/METABAT/SAMPLE/SAMPLE.unbinned.fa
│ └── output/Binning/sourmash
│ └── output/Binning/sourmash/SAMPLE
│ ├── output/Binning/sourmash/SAMPLE/SAMPLE_bins_sourmash_classification.log
│ ├── output/Binning/sourmash/SAMPLE/SAMPLE_bins_sourmash_descr.log
│ ├── output/Binning/sourmash/SAMPLE/SAMPLE_bins_sourmash_k51.log
│ ├── output/Binning/sourmash/SAMPLE/SAMPLE_bins_sourmash_k51.sig
│ └── output/Binning/sourmash/SAMPLE/SAMPLE_bins_sourmash_search.log
├── output/Data
│ └── output/Data/Binning
│ ├── output/Data/Binning/SAMPLE_bat.tsv
│ ├── output/Data/Binning/SAMPLE_binmap.tsv
│ ├── output/Data/Binning/SAMPLE_bins_sourmash_classification.tsv
│ ├── output/Data/Binning/SAMPLE_bins_sourmash_search.tsv
│ ├── output/Data/Binning/SAMPLE_bins_sourmash_signature.tsv
│ ├── output/Data/Binning/SAMPLE_checkm_cov.tsv
│ ├── output/Data/Binning/SAMPLE_checkm_profile.tsv
│ ├── output/Data/Binning/SAMPLE_checkm_tetras.tsv
│ ├── output/Data/Binning/sourmash_bin_comparison
│ ├── output/Data/Binning/sourmash_bin_comparison.dendro.png
│ ├── output/Data/Binning/sourmash_bin_comparison.hist.png
│ ├── output/Data/Binning/sourmash_bin_comparison.labels.txt
│ ├── output/Data/Binning/sourmash_bin_comparison.matrix.png
│ └── output/Data/Binning/sourmash_bin_comparison.tsv
└── output/Reports
├── output/Reports/8binning.html
└── output/Reports/BAT_krona.html
See the general file information for the Benchmarking
files.
The file SAMPLE_bat.tsv
uses the same format (bin
instead of contig
) as the CAT files. One important difference is that one bin can have multiple classifications.
If you need it, SAMPLE_binmap.tsv
has one column with the bins and reject files and a second column with the contig names in it.
The SAMPLE_bins_sourmash_signature.tsv
has the same format as the other signature description files. SAMPLE_bins_sourmash_classification.tsv
gives the sourmash classification for every bin:
- bin
- The bin name
- status
- nomatch or found
- Tax columns
- The assigned taxonomy
SAMPLE_checkm_cov.tsv
is the third coverage file for the contigs this pipeline provides based on the same BAM file. It is used by CheckM for its profile calculations. Just six columns are in it:
- contigid
- The short name of the contig
- bin
- The bin name
- len
- Contig lengths
- bamfile
- BAM file name
- avgcov
- Average coverage of this contig
- reads
- Read count mapped to this contig
This profile is written to the SAMPLE_checkm_profile.tsv
file. The bin unbinned
also contains the contigs which were too short or had not enough coverage.
- bin
- The bin name
- binsize_Mbp
- Size of the bin in Mega-base-pairs
- reads
- Reads mapped to this bin
- reads_perc
- Percentage of reads mapped to this bin
- ofbinnedpop_perc
- How much of the successfully binned population does this bin represent
- ofcommunity_perc
- Read count mapped to this contig
Mainly for educational purposes a file giving the tetra nulceotide frequencies for every contig (SAMPLE_checkm_tetras.tsv
) is available.
The well known CheckM lineage workflow data is placed in SAMPLE_checkm.tsv
:
- bin
- The bin name
- marker_lineage
- The marker set set selected by phylogenetic placement
- lineage_genomes
- Reference genomes which were placed on that phylogenetic ode
- lineage_markers
- Markers in this set of marker sets
- lineage_marker_sets
- Colocated marker sets used for this node
- X_sets
- Number of times (X) each marker was identified
- completeness
- Completion based on the markers
- contaimination
- Contamination based on the markers
- strain_heterogeneity
- Percentage of the contamination caused by very similar marker genes