Link Search Menu Expand Document

KnuttBinAnno

KnuttBinAnno combines different annotation pipelines into a single one while automating the installation process.

The pipelines are: MetaErg, dbCAN2, EggNOGMapper2, KofamKOALA, InterProScan and HydDB. The HydDB annotation uses the online web service and therfore requires an internet connection. InterProScan tries to use the online lookup service to speed up the annotation, but it can operate without it.

Quickstart

The all rule combines the tools given in the config file into a single tsv and gff file for the predicted CDS. MetaErg output is not included.

  1. Place your bin files (DNA) in the input/ folder.
  2. Check the config.yml
  3. Use the following schema for your read files: BIN.fa
  4. Run the whole workflow with:
snakemake -kj 32 --use-conda all allReport 2>&1 | tee run.log & disown

Table of contents

  1. Quickstart
  2. MetaErg
  3. dbCAN2
  4. eggNOG Mapper
  5. InterProScan
  6. KofamKOALA
  7. HydDB

MetaErg

MetaErg is the first pipeline and is always run, as it needs to run on the contigs directly and executes Prodigal. The vCPU limitation isn’t working correctly, as MetaErg starts multiple processes with the given coure count at the same time. The rule for execution is metaerg and metaergRefData for the installation.

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/metaerg_BIN.tsv
└── output/MetaErg
    └── output/MetaErg/BIN
        ├── output/MetaErg/BIN/BIN_ddmmyyyy.fna
        ├── output/MetaErg/BIN/data
        │   ├── output/MetaErg/BIN/data/XXSrRNA.ffn
        │   ├── output/MetaErg/BIN/data/all.gff
        │   ├── output/MetaErg/BIN/data/cds.faa
        │   ├── output/MetaErg/BIN/data/cds.ffn
        │   ├── output/MetaErg/BIN/data/crispr.ffn
        │   ├── output/MetaErg/BIN/data/crispr.tab.txt
        │   ├── output/MetaErg/BIN/data/master.gff.txt
        │   ├── output/MetaErg/BIN/data/master.stats.txt
        │   ├── output/MetaErg/BIN/data/master.tbl.txt
        │   ├── output/MetaErg/BIN/data/master.tsv.txt
        │   ├── output/MetaErg/BIN/data/rRNA.tab.txt
        │   ├── output/MetaErg/BIN/data/taxon.cds.profile.datatable.json
        │   ├── output/MetaErg/BIN/data/taxon.cds.profile.tab.txt
        │   ├── output/MetaErg/BIN/data/taxon.lsu.profile.tab.txt
        │   ├── output/MetaErg/BIN/data/taxon.ssu.profile.tab.txt
        │   ├── output/MetaErg/BIN/data/tRNA.ffn
        │   ├── output/MetaErg/BIN/data/tRNA.tab.txt
        │   │── output/MetaErg/BIN/data/tRNA.tab.txt
        │   └── ...
        ├── output/MetaErg/BIN/index.html
        └── output/MetaErg/BIN/metaerg_run.log

See the general file information for the Benchmarking files. Not all files are shown. The index.html is not standalone, it uses the .json files in the data directory for example.

To increase the ability to scale the execution of the workflow on HPC clusters, the CDS fasta files are split into multiple chunks. This can be configured in the config file. The split rule runs the splitting.

Some annotation tools and reports use the KEGG database. Make sure your use case is within the license agreement.

dbCAN2

At the time of the implementation the offical dbCAN2 workflow script was a little bit unstable and not distributable, therefore it has been reimplemented into this pipeline. You may want to compare results.

The reference data is created with dbCANRefData. The annotation is run with the dbCAN rule. A report is available with dbCANReport. The results are in the output/dbCAN/BIN/BIN_dbCAN_all.tsv file. It it is the base for a GFF3 file, you can use the following R code to extract data from those files:

source("scripts/commonGFF3Tools.R")
library(data.table)
library(tidyr)
bins <- list.files("output/dbCAN/")

dbcan <- paste0("output/dbCAN/", bins, "/", bins, "_dbCAN_all.tsv")
dbcan <- rbindlist(lapply(seq_along(bins), function(i)cbind(bin=bins[[i]], readBaseGFF(fread(dbcan[[i]])))))
cazydat <- as.data.table(unnest(as_tibble(dbcan[, .(bin, CAZY)]), "CAZY"))
cazydat[,class:=sub("^(\\D+)\\d.*", "\\1", CAZY)]
bincount <- cazydat[, .N, by=c("bin", "class", "CAZY")]

A custom version of a CGC finder places proposed clusters in output/dbCAN/BIN/BIN_dbCAN_cgc.tsv

cgc <- paste0("output/dbCAN/", bins, "/", bins, "_dbCAN_cgc.tsv")
cgc <- rbindlist(lapply(seq_along(bins), function(i)cbind(bin=bins[[i]], fread(cgc[[i]]))))
listcols <- c("members","sources")
cgc[, (listcols):=lapply(.SD,strsplit,split="|;|", fixed=T), .SD=listcols]
cgc <- as.data.table(unnest(cgc, all_of(listcols)))

eggNOG Mapper

The annotation tool of the eggNOG database can be run with: eggNOGRefData, eggNOG and eggNOGReport.

The file output/eggNOG/BIN/BIN_eggNOG.tsv also uses the GFF base format:

eggnog <- paste0("output/eggNOG/", bins, "/", bins, "_eggNOG.tsv")
eggnog <- rbindlist(lapply(seq_along(bins), function(i)cbind(bin=bins[[i]], readBaseGFF(fread(eggnog[[i]])))))

InterProScan

The following rules allow the execution of InterProScan: interProRefData, interPro and interProReport.

If you get this error: libdw.so.1: cannot open shared object file: No such file or directory, install elfutils on your system. If this is not possiblle, you can also turn off the RPSBLAST search by changing the InterProScan command line call.

The file output/InterProScan/BIN/BIN_iprscan.tsv contains the following columns:

seqid
Short ID of the CDS
md5
MD5 of the CDS aa sequence
len
Length of that sequence
analysis
Annotation tool/database
sigacc
Accession in the database
sigdescr
Human readable name of the signature hit
start
CDS start
stop
CDS end
score
Score or e-value, depending on the tool
status
Always T
date
Date of the annotation
interpro
IPR accession
interprodescr
Human readadble name of the InterPro entry
GO
GO terms (| seperated)
pathways
Pathways and term within (| seperated)

KofamKOALA

KofamKOALA is available through the following rules: kofamRefData and kofam.

The filtering of hits in KofamKOALA is based on score thresholds which are based on the score distribution of known family members. Some terms are to rare for this calculation. The files with sure only contain entries where a threshold was available and exceeded. all used the e-value in the config file to filter hits without a score threshhold.

The results are contained in a GFF base file (output/KofamKOALA/BIN/BIN_KofamKOALA_gffbase.tsv) and a normal tsv file (output/KofamKOALA/BIN/BIN_KofamKOALA.tsv).

hit
Did this match meet the threshold
seqid
Short ID of the CDS
KO
K term hit
thrshld
Score threshold for this term
score
Score of this match
eval
E-value of this match
descr
Human readable name of the term

output/KofamKOALA/BIN/BIN_KofamKOALA_kos_all.tsv and output/KofamKOALA/BIN/BIN_KofamKOALA_kos_sure.tsv can be uploaded to the KEGG BRITE mapping tool. This module search is also performed in the workflow. Results should be similar.

Module search results are stored in the files output/KofamKOALA/BIN/BIN_KofamKOALA_kos_sure_modules.tsv and output/KofamKOALA/BIN/BIN_KofamKOALA_kos_all_modules.tsv.

module
Accession of the module
name
Human readable name
Module.Type
Type (Pathway/Signature)
Upper.Class
Carbohydrate/Amino acid metabolism for example
Lower.Class
Subtypes
allterms
All required terms in the module
hitterms
Terms found in the bins
alloptionals
All optional terms
hitoptionals
Optional terms in the bin
optionalblockcount
Number of optional blocks
optionalblockhits
Optional blocks complete
blockcount
Number of blocks in the module
blockhits
Completed blocks
completion
Number of completed blocks divided by the total block count
hits
CDS contributing to this module
optionals
(Optional) CDS contributing to this module

HydDB

The HydDB classifier is integrated into the workflow using the online tool. CDS selection is done using an RPSBLAST search. Rules for this step are: hydDB, hydDBRefData and hydDBReport

The results are placed in the following file: output/HydDB/BIN/BIN_HydDB.tsv

query
Short ID of the CDS
subject
RPSBLAST Hit
identity
Identity to the pattern
allength
Alignment length
mismatches
Mismatches on the alignment
gapopenings
Gap regions opened
querystart
Start on the CDS
queryend
End on the CDS
subjectstart
Start on the pattern
subjectend
End on the pattern
eval
E-value
score
Bitscore
center
Config defined group of this CDD
shortname
Short human readable name
name
Description of the CDD
subjectlength
Length of the pattertn
hitid
Short ID of the CDS:Query Start:Query End, used during submission
Classification
Classification returned by the online service