KnuttBinAnno

KnuttBinAnno combines different annotation pipelines into a single one while automating the installation process.

The pipelines are: MetaErg, dbCAN2, EggNOGMapper2, KofamKOALA, InterProScan and HydDB. The HydDB annotation uses the online web service and therfore requires an internet connection. InterProScan tries to use the online lookup service to speed up the annotation, but it can operate without it.

Quickstart

The all rule combines the tools given in the config file into a single tsv and gff file for the predicted CDS. MetaErg output is not included.

Place your bin files (DNA) in the input/ folder.
Check the config.yml
Use the following schema for your read files: BIN.fa
Run the whole workflow with:

snakemake -kj 32 --use-conda all allReport 2>&1 | tee run.log & disown

Quickstart
MetaErg
dbCAN2
eggNOG Mapper
InterProScan
KofamKOALA
HydDB

MetaErg

MetaErg is the first pipeline and is always run, as it needs to run on the contigs directly and executes Prodigal. The vCPU limitation isn’t working correctly, as MetaErg starts multiple processes with the given coure count at the same time. The rule for execution is metaerg and metaergRefData for the installation.

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/metaerg_BIN.tsv
└── output/MetaErg
    └── output/MetaErg/BIN
        ├── output/MetaErg/BIN/BIN_ddmmyyyy.fna
        ├── output/MetaErg/BIN/data
        │   ├── output/MetaErg/BIN/data/XXSrRNA.ffn
        │   ├── output/MetaErg/BIN/data/all.gff
        │   ├── output/MetaErg/BIN/data/cds.faa
        │   ├── output/MetaErg/BIN/data/cds.ffn
        │   ├── output/MetaErg/BIN/data/crispr.ffn
        │   ├── output/MetaErg/BIN/data/crispr.tab.txt
        │   ├── output/MetaErg/BIN/data/master.gff.txt
        │   ├── output/MetaErg/BIN/data/master.stats.txt
        │   ├── output/MetaErg/BIN/data/master.tbl.txt
        │   ├── output/MetaErg/BIN/data/master.tsv.txt
        │   ├── output/MetaErg/BIN/data/rRNA.tab.txt
        │   ├── output/MetaErg/BIN/data/taxon.cds.profile.datatable.json
        │   ├── output/MetaErg/BIN/data/taxon.cds.profile.tab.txt
        │   ├── output/MetaErg/BIN/data/taxon.lsu.profile.tab.txt
        │   ├── output/MetaErg/BIN/data/taxon.ssu.profile.tab.txt
        │   ├── output/MetaErg/BIN/data/tRNA.ffn
        │   ├── output/MetaErg/BIN/data/tRNA.tab.txt
        │   │── output/MetaErg/BIN/data/tRNA.tab.txt
        │   └── ...
        ├── output/MetaErg/BIN/index.html
        └── output/MetaErg/BIN/metaerg_run.log

See the general file information for the Benchmarking files. Not all files are shown. The index.html is not standalone, it uses the .json files in the data directory for example.

To increase the ability to scale the execution of the workflow on HPC clusters, the CDS fasta files are split into multiple chunks. This can be configured in the config file. The split rule runs the splitting.

Some annotation tools and reports use the KEGG database. Make sure your use case is within the license agreement.

dbCAN2

At the time of the implementation the offical dbCAN2 workflow script was a little bit unstable and not distributable, therefore it has been reimplemented into this pipeline. You may want to compare results.

The reference data is created with dbCANRefData. The annotation is run with the dbCAN rule. A report is available with dbCANReport. The results are in the output/dbCAN/BIN/BIN_dbCAN_all.tsv file. It it is the base for a GFF3 file, you can use the following R code to extract data from those files:

source("scripts/commonGFF3Tools.R")
library(data.table)
library(tidyr)
bins <- list.files("output/dbCAN/")

dbcan <- paste0("output/dbCAN/", bins, "/", bins, "_dbCAN_all.tsv")
dbcan <- rbindlist(lapply(seq_along(bins), function(i)cbind(bin=bins[[i]], readBaseGFF(fread(dbcan[[i]])))))
cazydat <- as.data.table(unnest(as_tibble(dbcan[, .(bin, CAZY)]), "CAZY"))
cazydat[,class:=sub("^(\\D+)\\d.*", "\\1", CAZY)]
bincount <- cazydat[, .N, by=c("bin", "class", "CAZY")]

A custom version of a CGC finder places proposed clusters in output/dbCAN/BIN/BIN_dbCAN_cgc.tsv

cgc <- paste0("output/dbCAN/", bins, "/", bins, "_dbCAN_cgc.tsv")
cgc <- rbindlist(lapply(seq_along(bins), function(i)cbind(bin=bins[[i]], fread(cgc[[i]]))))
listcols <- c("members","sources")
cgc[, (listcols):=lapply(.SD,strsplit,split="|;|", fixed=T), .SD=listcols]
cgc <- as.data.table(unnest(cgc, all_of(listcols)))

eggNOG Mapper

The annotation tool of the eggNOG database can be run with: eggNOGRefData, eggNOG and eggNOGReport.

The file output/eggNOG/BIN/BIN_eggNOG.tsv also uses the GFF base format:

eggnog <- paste0("output/eggNOG/", bins, "/", bins, "_eggNOG.tsv")
eggnog <- rbindlist(lapply(seq_along(bins), function(i)cbind(bin=bins[[i]], readBaseGFF(fread(eggnog[[i]])))))

InterProScan

The following rules allow the execution of InterProScan: interProRefData, interPro and interProReport.

If you get this error: libdw.so.1: cannot open shared object file: No such file or directory, install elfutils on your system. If this is not possiblle, you can also turn off the RPSBLAST search by changing the InterProScan command line call.

The file output/InterProScan/BIN/BIN_iprscan.tsv contains the following columns:

seqid: Short ID of the CDS
md5: MD5 of the CDS aa sequence
len: Length of that sequence
analysis: Annotation tool/database
sigacc: Accession in the database
sigdescr: Human readable name of the signature hit
start: CDS start
stop: CDS end
score: Score or e-value, depending on the tool
status: Always T
date: Date of the annotation
interpro: IPR accession
interprodescr: Human readadble name of the InterPro entry
GO: GO terms (| seperated)
pathways: Pathways and term within (| seperated)

KofamKOALA

KofamKOALA is available through the following rules: kofamRefData and kofam.

The filtering of hits in KofamKOALA is based on score thresholds which are based on the score distribution of known family members. Some terms are to rare for this calculation. The files with sure only contain entries where a threshold was available and exceeded. all used the e-value in the config file to filter hits without a score threshhold.

The results are contained in a GFF base file (output/KofamKOALA/BIN/BIN_KofamKOALA_gffbase.tsv) and a normal tsv file (output/KofamKOALA/BIN/BIN_KofamKOALA.tsv).

hit: Did this match meet the threshold
seqid: Short ID of the CDS
KO: K term hit
thrshld: Score threshold for this term
score: Score of this match
eval: E-value of this match
descr: Human readable name of the term

output/KofamKOALA/BIN/BIN_KofamKOALA_kos_all.tsv and output/KofamKOALA/BIN/BIN_KofamKOALA_kos_sure.tsv can be uploaded to the KEGG BRITE mapping tool. This module search is also performed in the workflow. Results should be similar.

Module search results are stored in the files output/KofamKOALA/BIN/BIN_KofamKOALA_kos_sure_modules.tsv and output/KofamKOALA/BIN/BIN_KofamKOALA_kos_all_modules.tsv.

module: Accession of the module
name: Human readable name
Module.Type: Type (Pathway/Signature)
Upper.Class: Carbohydrate/Amino acid metabolism for example
Lower.Class: Subtypes
allterms: All required terms in the module
hitterms: Terms found in the bins
alloptionals: All optional terms
hitoptionals: Optional terms in the bin
optionalblockcount: Number of optional blocks
optionalblockhits: Optional blocks complete
blockcount: Number of blocks in the module
blockhits: Completed blocks
completion: Number of completed blocks divided by the total block count
hits: CDS contributing to this module
optionals: (Optional) CDS contributing to this module

HydDB

The HydDB classifier is integrated into the workflow using the online tool. CDS selection is done using an RPSBLAST search. Rules for this step are: hydDB, hydDBRefData and hydDBReport

The results are placed in the following file: output/HydDB/BIN/BIN_HydDB.tsv

query: Short ID of the CDS
subject: RPSBLAST Hit
identity: Identity to the pattern
allength: Alignment length
mismatches: Mismatches on the alignment
gapopenings: Gap regions opened
querystart: Start on the CDS
queryend: End on the CDS
subjectstart: Start on the pattern
subjectend: End on the pattern
eval: E-value
score: Bitscore
center: Config defined group of this CDD
shortname: Short human readable name
name: Description of the CDD
subjectlength: Length of the pattertn
hitid: Short ID of the CDS:Query Start:Query End, used during submission
Classification: Classification returned by the online service