Link

Binning

The final step of the KnuttReads2Bins pipeline, the binning of the contigs into MAGs/bins is executed using METABAT2 in the Snakefile_5Binning file.

Small metagenome
Tbd. GB
Medium metagenome
Tbd. GB
Large metagenome
Tbd. GB

The binning rule executes METABAT and CheckM. The bins can be also classified with Sourmash (binningSourmash) and BAT (bat). BAT uses the classification of the contigs from CAT. The taxonomic composition for the bins based on the CAT classification can be generated with the batKrona rule. A report is available with binningReport.

If you are interested in the relationship of your bins to published genomes/bins checkout KnuttBinPhylo.


Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/Binning
│       ├── output/Benchmarking/Binning/bat_SAMPLE.tsv
│       ├── output/Benchmarking/Binning/binning_report.tsv
│       ├── output/Benchmarking/Binning/checkm_extra_SAMPLE.tsv
│       ├── output/Benchmarking/Binning/checkm_wf_SAMPLE.tsv
│       ├── output/Benchmarking/Binning/metabat_SAMPLE.tsv
│       ├── output/Benchmarking/Binning/sourmash_bins_description_SAMPLE.tsv
│       ├── output/Benchmarking/Binning/sourmash_bins_k51_SAMPLE.tsv
│       └── output/Benchmarking/Binning/sourmash_classification_SAMPLE.tsv
├── output/Binning
│   ├── output/Binning/BAT
│   │   └── output/Binning/BAT/SAMPLE
│   │       ├── output/Binning/BAT/SAMPLE/SAMPLE_BAT.log
│   │       ├── output/Binning/BAT/SAMPLE/SAMPLE.bin2classification.named.txt
│   │       ├── output/Binning/BAT/SAMPLE/SAMPLE.bin2classification.txt
│   │       ├── output/Binning/BAT/SAMPLE/SAMPLE.bins.alignment.diamond
│   │       ├── output/Binning/BAT/SAMPLE/SAMPLE.bins.predicted_proteins.faa
│   │       ├── output/Binning/BAT/SAMPLE/SAMPLE.log
│   │       └── output/Binning/BAT/SAMPLE/SAMPLE.ORF2LCA.txt
│   ├── output/Binning/CheckM
│   │   └── output/Binning/CheckM/SAMPLE
│   │       ├── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_data
│   │       │   ├── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_data/bins
│   │       │   │   └── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_data/bins/SAMPLE.X
│   │       │   │       └── ...
│   │       │   ├── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_data/checkm.log
│   │       │   ├── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_data/lineage.ms
│   │       │   └── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_data/storage
│   │       │       └── ...
│   │       ├── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_helperfiles.log
│   │       └── output/Binning/CheckM/SAMPLE/SAMPLE_checkm_wf.log
│   ├── output/Binning/METABAT
│   │   └── output/Binning/METABAT/SAMPLE
│   │       ├── output/Binning/METABAT/SAMPLE/bins
│   │       │   ├── output/Binning/METABAT/SAMPLE/bins/SAMPLE.X.fa
│   │       │   └── ...
│   │       ├── output/Binning/METABAT/SAMPLE/SAMPLE.lowDepth.fa
│   │       ├── output/Binning/METABAT/SAMPLE/SAMPLE_metabat2.log
│   │       ├── output/Binning/METABAT/SAMPLE/SAMPLE.tooShort.fa
│   │       └── output/Binning/METABAT/SAMPLE/SAMPLE.unbinned.fa
│   └── output/Binning/sourmash
│       └── output/Binning/sourmash/SAMPLE
│           ├── output/Binning/sourmash/SAMPLE/SAMPLE_bins_sourmash_classification.log
│           ├── output/Binning/sourmash/SAMPLE/SAMPLE_bins_sourmash_descr.log
│           ├── output/Binning/sourmash/SAMPLE/SAMPLE_bins_sourmash_k51.log
│           ├── output/Binning/sourmash/SAMPLE/SAMPLE_bins_sourmash_k51.sig
│           └── output/Binning/sourmash/SAMPLE/SAMPLE_bins_sourmash_search.log
├── output/Data
│   └── output/Data/Binning
│       ├── output/Data/Binning/SAMPLE_bat.tsv
│       ├── output/Data/Binning/SAMPLE_binmap.tsv
│       ├── output/Data/Binning/SAMPLE_bins_sourmash_classification.tsv
│       ├── output/Data/Binning/SAMPLE_bins_sourmash_search.tsv
│       ├── output/Data/Binning/SAMPLE_bins_sourmash_signature.tsv
│       ├── output/Data/Binning/SAMPLE_checkm_cov.tsv
│       ├── output/Data/Binning/SAMPLE_checkm_profile.tsv
│       ├── output/Data/Binning/SAMPLE_checkm_tetras.tsv
│       ├── output/Data/Binning/sourmash_bin_comparison
│       ├── output/Data/Binning/sourmash_bin_comparison.dendro.png
│       ├── output/Data/Binning/sourmash_bin_comparison.hist.png
│       ├── output/Data/Binning/sourmash_bin_comparison.labels.txt
│       ├── output/Data/Binning/sourmash_bin_comparison.matrix.png
│       └── output/Data/Binning/sourmash_bin_comparison.tsv
└── output/Reports
    ├── output/Reports/8binning.html
    └── output/Reports/BAT_krona.html

See the general file information for the Benchmarking files.

The file SAMPLE_bat.tsv uses the same format (bin instead of contig) as the CAT files. One important difference is that one bin can have multiple classifications.

If you need it, SAMPLE_binmap.tsv has one column with the bins and reject files and a second column with the contig names in it.

The SAMPLE_bins_sourmash_signature.tsv has the same format as the other signature description files. SAMPLE_bins_sourmash_classification.tsv gives the sourmash classification for every bin:

bin
The bin name
status
nomatch or found
Tax columns
The assigned taxonomy

SAMPLE_checkm_cov.tsv is the third coverage file for the contigs this pipeline provides based on the same BAM file. It is used by CheckM for its profile calculations. Just six columns are in it:

contigid
The short name of the contig
bin
The bin name
len
Contig lengths
bamfile
BAM file name
avgcov
Average coverage of this contig
reads
Read count mapped to this contig

This profile is written to the SAMPLE_checkm_profile.tsv file. The bin unbinned also contains the contigs which were too short or had not enough coverage.

bin
The bin name
binsize_Mbp
Size of the bin in Mega-base-pairs
reads
Reads mapped to this bin
reads_perc
Percentage of reads mapped to this bin
ofbinnedpop_perc
How much of the successfully binned population does this bin represent
ofcommunity_perc
Read count mapped to this contig

Mainly for educational purposes a file giving the tetra nulceotide frequencies for every contig (SAMPLE_checkm_tetras.tsv) is available.

The well known CheckM lineage workflow data is placed in SAMPLE_checkm.tsv:

bin
The bin name
marker_lineage
The marker set set selected by phylogenetic placement
lineage_genomes
Reference genomes which were placed on that phylogenetic ode
lineage_markers
Markers in this set of marker sets
lineage_marker_sets
Colocated marker sets used for this node
X_sets
Number of times (X) each marker was identified
completeness
Completion based on the markers
contaimination
Contamination based on the markers
strain_heterogeneity
Percentage of the contamination caused by very similar marker genes