KnuttBinAnno
KnuttBinAnno combines different annotation pipelines into a single one while automating the installation process.
The pipelines are: MetaErg, dbCAN2, EggNOGMapper2, KofamKOALA, InterProScan and HydDB. The HydDB annotation uses the online web service and therfore requires an internet connection. InterProScan tries to use the online lookup service to speed up the annotation, but it can operate without it.
Quickstart
The all
rule combines the tools given in the config file into a single tsv and gff file for the predicted CDS. MetaErg output is not included.
- Place your bin files (DNA) in the
input/
folder. - Check the
config.yml
- Use the following schema for your read files:
BIN.fa
- Run the whole workflow with:
snakemake -kj 32 --use-conda all allReport 2>&1 | tee run.log & disown
Table of contents
MetaErg
MetaErg is the first pipeline and is always run, as it needs to run on the contigs directly and executes Prodigal. The vCPU limitation isn’t working correctly, as MetaErg starts multiple processes with the given coure count at the same time. The rule for execution is metaerg
and metaergRefData
for the installation.
Output files
output
├── output/Benchmarking
│ └── output/Benchmarking/metaerg_BIN.tsv
└── output/MetaErg
└── output/MetaErg/BIN
├── output/MetaErg/BIN/BIN_ddmmyyyy.fna
├── output/MetaErg/BIN/data
│ ├── output/MetaErg/BIN/data/XXSrRNA.ffn
│ ├── output/MetaErg/BIN/data/all.gff
│ ├── output/MetaErg/BIN/data/cds.faa
│ ├── output/MetaErg/BIN/data/cds.ffn
│ ├── output/MetaErg/BIN/data/crispr.ffn
│ ├── output/MetaErg/BIN/data/crispr.tab.txt
│ ├── output/MetaErg/BIN/data/master.gff.txt
│ ├── output/MetaErg/BIN/data/master.stats.txt
│ ├── output/MetaErg/BIN/data/master.tbl.txt
│ ├── output/MetaErg/BIN/data/master.tsv.txt
│ ├── output/MetaErg/BIN/data/rRNA.tab.txt
│ ├── output/MetaErg/BIN/data/taxon.cds.profile.datatable.json
│ ├── output/MetaErg/BIN/data/taxon.cds.profile.tab.txt
│ ├── output/MetaErg/BIN/data/taxon.lsu.profile.tab.txt
│ ├── output/MetaErg/BIN/data/taxon.ssu.profile.tab.txt
│ ├── output/MetaErg/BIN/data/tRNA.ffn
│ ├── output/MetaErg/BIN/data/tRNA.tab.txt
│ │── output/MetaErg/BIN/data/tRNA.tab.txt
│ └── ...
├── output/MetaErg/BIN/index.html
└── output/MetaErg/BIN/metaerg_run.log
See the general file information for the Benchmarking
files. Not all files are shown. The index.html
is not standalone, it uses the .json
files in the data directory for example.
To increase the ability to scale the execution of the workflow on HPC clusters, the CDS fasta files are split into multiple chunks. This can be configured in the config file. The split
rule runs the splitting.
Some annotation tools and reports use the KEGG database. Make sure your use case is within the license agreement.
dbCAN2
At the time of the implementation the offical dbCAN2 workflow script was a little bit unstable and not distributable, therefore it has been reimplemented into this pipeline. You may want to compare results.
The reference data is created with dbCANRefData
. The annotation is run with the dbCAN
rule. A report is available with dbCANReport
. The results are in the output/dbCAN/BIN/BIN_dbCAN_all.tsv
file. It it is the base for a GFF3 file, you can use the following R code to extract data from those files:
source("scripts/commonGFF3Tools.R")
library(data.table)
library(tidyr)
bins <- list.files("output/dbCAN/")
dbcan <- paste0("output/dbCAN/", bins, "/", bins, "_dbCAN_all.tsv")
dbcan <- rbindlist(lapply(seq_along(bins), function(i)cbind(bin=bins[[i]], readBaseGFF(fread(dbcan[[i]])))))
cazydat <- as.data.table(unnest(as_tibble(dbcan[, .(bin, CAZY)]), "CAZY"))
cazydat[,class:=sub("^(\\D+)\\d.*", "\\1", CAZY)]
bincount <- cazydat[, .N, by=c("bin", "class", "CAZY")]
A custom version of a CGC finder places proposed clusters in output/dbCAN/BIN/BIN_dbCAN_cgc.tsv
cgc <- paste0("output/dbCAN/", bins, "/", bins, "_dbCAN_cgc.tsv")
cgc <- rbindlist(lapply(seq_along(bins), function(i)cbind(bin=bins[[i]], fread(cgc[[i]]))))
listcols <- c("members","sources")
cgc[, (listcols):=lapply(.SD,strsplit,split="|;|", fixed=T), .SD=listcols]
cgc <- as.data.table(unnest(cgc, all_of(listcols)))
eggNOG Mapper
The annotation tool of the eggNOG database can be run with: eggNOGRefData
, eggNOG
and eggNOGReport
.
The file output/eggNOG/BIN/BIN_eggNOG.tsv
also uses the GFF base format:
eggnog <- paste0("output/eggNOG/", bins, "/", bins, "_eggNOG.tsv")
eggnog <- rbindlist(lapply(seq_along(bins), function(i)cbind(bin=bins[[i]], readBaseGFF(fread(eggnog[[i]])))))
InterProScan
The following rules allow the execution of InterProScan: interProRefData
, interPro
and interProReport
.
If you get this error: libdw.so.1: cannot open shared object file: No such file or directory
, install elfutils
on your system. If this is not possiblle, you can also turn off the RPSBLAST
search by changing the InterProScan command line call.
The file output/InterProScan/BIN/BIN_iprscan.tsv
contains the following columns:
- seqid
- Short ID of the CDS
- md5
- MD5 of the CDS aa sequence
- len
- Length of that sequence
- analysis
- Annotation tool/database
- sigacc
- Accession in the database
- sigdescr
- Human readable name of the signature hit
- start
- CDS start
- stop
- CDS end
- score
- Score or e-value, depending on the tool
- status
- Always T
- date
- Date of the annotation
- interpro
- IPR accession
- interprodescr
- Human readadble name of the InterPro entry
- GO
- GO terms (| seperated)
- pathways
- Pathways and term within (| seperated)
KofamKOALA
KofamKOALA is available through the following rules: kofamRefData
and kofam
.
The filtering of hits in KofamKOALA is based on score thresholds which are based on the score distribution of known family members. Some terms are to rare for this calculation. The files with sure
only contain entries where a threshold was available and exceeded. all
used the e-value in the config file to filter hits without a score threshhold.
The results are contained in a GFF base file (output/KofamKOALA/BIN/BIN_KofamKOALA_gffbase.tsv
) and a normal tsv file (output/KofamKOALA/BIN/BIN_KofamKOALA.tsv
).
- hit
- Did this match meet the threshold
- seqid
- Short ID of the CDS
- KO
- K term hit
- thrshld
- Score threshold for this term
- score
- Score of this match
- eval
- E-value of this match
- descr
- Human readable name of the term
output/KofamKOALA/BIN/BIN_KofamKOALA_kos_all.tsv
and output/KofamKOALA/BIN/BIN_KofamKOALA_kos_sure.tsv
can be uploaded to the KEGG BRITE mapping tool. This module search is also performed in the workflow. Results should be similar.
Module search results are stored in the files output/KofamKOALA/BIN/BIN_KofamKOALA_kos_sure_modules.tsv
and output/KofamKOALA/BIN/BIN_KofamKOALA_kos_all_modules.tsv
.
- module
- Accession of the module
- name
- Human readable name
- Module.Type
- Type (Pathway/Signature)
- Upper.Class
- Carbohydrate/Amino acid metabolism for example
- Lower.Class
- Subtypes
- allterms
- All required terms in the module
- hitterms
- Terms found in the bins
- alloptionals
- All optional terms
- hitoptionals
- Optional terms in the bin
- optionalblockcount
- Number of optional blocks
- optionalblockhits
- Optional blocks complete
- blockcount
- Number of blocks in the module
- blockhits
- Completed blocks
- completion
- Number of completed blocks divided by the total block count
- hits
- CDS contributing to this module
- optionals
- (Optional) CDS contributing to this module
HydDB
The HydDB classifier is integrated into the workflow using the online tool. CDS selection is done using an RPSBLAST search. Rules for this step are: hydDB
, hydDBRefData
and hydDBReport
The results are placed in the following file: output/HydDB/BIN/BIN_HydDB.tsv
- query
- Short ID of the CDS
- subject
- RPSBLAST Hit
- identity
- Identity to the pattern
- allength
- Alignment length
- mismatches
- Mismatches on the alignment
- gapopenings
- Gap regions opened
- querystart
- Start on the CDS
- queryend
- End on the CDS
- subjectstart
- Start on the pattern
- subjectend
- End on the pattern
- eval
- E-value
- score
- Bitscore
- center
- Config defined group of this CDD
- shortname
- Short human readable name
- name
- Description of the CDD
- subjectlength
- Length of the pattertn
- hitid
- Short ID of the CDS:Query Start:Query End, used during submission
- Classification
- Classification returned by the online service