KnuttBinAnno combines different annotation pipelines into a single one while automating the installation process.
The pipelines are: MetaErg, dbCAN2, EggNOGMapper2, KofamKOALA, InterProScan and HydDB. The HydDB annotation uses the online web service and therfore requires an internet connection. InterProScan tries to use the online lookup service to speed up the annotation, but it can operate without it.
all rule combines the tools given in the config file into a single tsv and gff file for the predicted CDS. MetaErg output is not included.
- Place your bin files (DNA) in the
- Check the
- Use the following schema for your read files:
- Run the whole workflow with:
snakemake -kj 32 --use-conda all allReport 2>&1 | tee run.log & disown
MetaErg is the first pipeline and is always run, as it needs to run on the contigs directly and executes Prodigal. The vCPU limitation isn’t working correctly, as MetaErg starts multiple processes with the given coure count at the same time. The rule for execution is
metaergRefData for the installation.
output ├── output/Benchmarking │ └── output/Benchmarking/metaerg_BIN.tsv └── output/MetaErg └── output/MetaErg/BIN ├── output/MetaErg/BIN/BIN_ddmmyyyy.fna ├── output/MetaErg/BIN/data │ ├── output/MetaErg/BIN/data/XXSrRNA.ffn │ ├── output/MetaErg/BIN/data/all.gff │ ├── output/MetaErg/BIN/data/cds.faa │ ├── output/MetaErg/BIN/data/cds.ffn │ ├── output/MetaErg/BIN/data/crispr.ffn │ ├── output/MetaErg/BIN/data/crispr.tab.txt │ ├── output/MetaErg/BIN/data/master.gff.txt │ ├── output/MetaErg/BIN/data/master.stats.txt │ ├── output/MetaErg/BIN/data/master.tbl.txt │ ├── output/MetaErg/BIN/data/master.tsv.txt │ ├── output/MetaErg/BIN/data/rRNA.tab.txt │ ├── output/MetaErg/BIN/data/taxon.cds.profile.datatable.json │ ├── output/MetaErg/BIN/data/taxon.cds.profile.tab.txt │ ├── output/MetaErg/BIN/data/taxon.lsu.profile.tab.txt │ ├── output/MetaErg/BIN/data/taxon.ssu.profile.tab.txt │ ├── output/MetaErg/BIN/data/tRNA.ffn │ ├── output/MetaErg/BIN/data/tRNA.tab.txt │ │── output/MetaErg/BIN/data/tRNA.tab.txt │ └── ... ├── output/MetaErg/BIN/index.html └── output/MetaErg/BIN/metaerg_run.log
See the general file information for the
Benchmarking files. Not all files are shown. The
index.html is not standalone, it uses the
.json files in the data directory for example.
To increase the ability to scale the execution of the workflow on HPC clusters, the CDS fasta files are split into multiple chunks. This can be configured in the config file. The
split rule runs the splitting.
Some annotation tools and reports use the KEGG database. Make sure your use case is within the license agreement.
At the time of the implementation the offical dbCAN2 workflow script was a little bit unstable and not distributable, therefore it has been reimplemented into this pipeline. You may want to compare results.
The reference data is created with
dbCANRefData. The annotation is run with the
dbCAN rule. A report is available with
dbCANReport. The results are in the
output/dbCAN/BIN/BIN_dbCAN_all.tsv file. It it is the base for a GFF3 file, you can use the following R code to extract data from those files:
source("scripts/commonGFF3Tools.R") library(data.table) library(tidyr) bins <- list.files("output/dbCAN/") dbcan <- paste0("output/dbCAN/", bins, "/", bins, "_dbCAN_all.tsv") dbcan <- rbindlist(lapply(seq_along(bins), function(i)cbind(bin=bins[[i]], readBaseGFF(fread(dbcan[[i]]))))) cazydat <- as.data.table(unnest(as_tibble(dbcan[, .(bin, CAZY)]), "CAZY")) cazydat[,class:=sub("^(\\D+)\\d.*", "\\1", CAZY)] bincount <- cazydat[, .N, by=c("bin", "class", "CAZY")]
A custom version of a CGC finder places proposed clusters in
cgc <- paste0("output/dbCAN/", bins, "/", bins, "_dbCAN_cgc.tsv") cgc <- rbindlist(lapply(seq_along(bins), function(i)cbind(bin=bins[[i]], fread(cgc[[i]])))) listcols <- c("members","sources") cgc[, (listcols):=lapply(.SD,strsplit,split="|;|", fixed=T), .SD=listcols] cgc <- as.data.table(unnest(cgc, all_of(listcols)))
The annotation tool of the eggNOG database can be run with:
output/eggNOG/BIN/BIN_eggNOG.tsv also uses the GFF base format:
eggnog <- paste0("output/eggNOG/", bins, "/", bins, "_eggNOG.tsv") eggnog <- rbindlist(lapply(seq_along(bins), function(i)cbind(bin=bins[[i]], readBaseGFF(fread(eggnog[[i]])))))
The following rules allow the execution of InterProScan:
If you get this error:
libdw.so.1: cannot open shared object file: No such file or directory, install
elfutils on your system. If this is not possiblle, you can also turn off the
RPSBLAST search by changing the InterProScan command line call.
output/InterProScan/BIN/BIN_iprscan.tsv contains the following columns:
- Short ID of the CDS
- MD5 of the CDS aa sequence
- Length of that sequence
- Annotation tool/database
- Accession in the database
- Human readable name of the signature hit
- CDS start
- CDS end
- Score or e-value, depending on the tool
- Always T
- Date of the annotation
- IPR accession
- Human readadble name of the InterPro entry
- GO terms (| seperated)
- Pathways and term within (| seperated)
KofamKOALA is available through the following rules:
The filtering of hits in KofamKOALA is based on score thresholds which are based on the score distribution of known family members. Some terms are to rare for this calculation. The files with
sure only contain entries where a threshold was available and exceeded.
all used the e-value in the config file to filter hits without a score threshhold.
The results are contained in a GFF base file (
output/KofamKOALA/BIN/BIN_KofamKOALA_gffbase.tsv) and a normal tsv file (
- Did this match meet the threshold
- Short ID of the CDS
- K term hit
- Score threshold for this term
- Score of this match
- E-value of this match
- Human readable name of the term
output/KofamKOALA/BIN/BIN_KofamKOALA_kos_sure.tsv can be uploaded to the KEGG BRITE mapping tool. This module search is also performed in the workflow. Results should be similar.
Module search results are stored in the files
- Accession of the module
- Human readable name
- Type (Pathway/Signature)
- Carbohydrate/Amino acid metabolism for example
- All required terms in the module
- Terms found in the bins
- All optional terms
- Optional terms in the bin
- Number of optional blocks
- Optional blocks complete
- Number of blocks in the module
- Completed blocks
- Number of completed blocks divided by the total block count
- CDS contributing to this module
- (Optional) CDS contributing to this module
The HydDB classifier is integrated into the workflow using the online tool. CDS selection is done using an RPSBLAST search. Rules for this step are:
The results are placed in the following file:
- Short ID of the CDS
- RPSBLAST Hit
- Identity to the pattern
- Alignment length
- Mismatches on the alignment
- Gap regions opened
- Start on the CDS
- End on the CDS
- Start on the pattern
- End on the pattern
- Config defined group of this CDD
- Short human readable name
- Description of the CDD
- Length of the pattertn
- Short ID of the CDS:Query Start:Query End, used during submission
- Classification returned by the online service