Link

Read Annotation

For some applications running an assembly and binning process might not be necessary. KnuttReads2Bins uses DIAMOND BLASTX searches against protein databases to provide compositional insights into a sample. The problems with a simple BLAST search for functional annotation should be considered, when using data from this step.

The HydDB and dbCAN (CAZyDB) protein databases are integrated into the workflow. Users can create custom databases by providing UniProtKB queries. The Krona reports use the taxonomy of the entries in the databases and the BLAST search results, but for dbCAN/CAZy an additional functional Krona is available.

The steps for this stage are defined in the Snakefile_3AnnotateReads file. With two custom databases the reference data requires around 1 GB and 9 GB for the NCBI taxonomy.

Small metagenome
Tbd. GB
Medium metagenome
Tbd. GB
Large metagenome
Tbd. GB

Reference data can be generated with the rule readAnnoRefData. All classification steps can be run with the readAnno, readAnnoKrona and readAnnoReport rules.

Outputs are stored in <outputdir>/ReadAnnotation/, <outputdir>/Data/ReadAnnotation/ and <outputdir>/Reports/.


The files for specific databases can be produced with readAnnoDATABASEDBRefData, readAnnoDATABASE and readAnnoDATABASEKrona. The functional Krona is currently only available for CAZy. DATABASE can be Custom for the UniProtKB based databases.

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadAnnotation
│       └── output/Benchmarking/ReadAnnotation/diamond_run_DATABASE_SAMPLE.tsv
├── output/Data
│   └── output/Data/ReadAnnotation
│       └── output/Data/ReadAnnotation/SAMPLE_readanno_DATABASE.tsv
├── output/ReadAnnotation
│   └── output/ReadAnnotation/SAMPLE
│       ├── output/ReadAnnotation/SAMPLE/SAMPLE_prot_DATABASE_blast.log
│       └── output/ReadAnnotation/SAMPLE/SAMPLE_prot_DATABASE_blast.tsv
└── output/Reports
    ├── output/Reports/6readanno_DATABASE.html
    ├── output/Reports/readanno_CAZyDB_funkrona.html
    └── output/Reports/readanno_DATABASE_krona.html

Current database options are:

CAzyDB
dbCAN2 project based database
HydDB
Sequences in the hydrogenase database project
Custom
Produce files for all custom databases

See the general file information for the Benchmarking files. The SAMPLE_readanno_DATABASE.tsv file contains the BLAST hits with the lowest e-value for each query.

qname
Full id of the read
sname
Full id of the subject
pident
Percentage of identical base pairs in the alignment
length
Alignment length
mismatch
Number of mismatches
gapopen
Number of gap sequences opened
qstart
Start in the read
qend
End in the read
sstart
Start on the subject
send
End on the subject
evalue
E value
bitscore
Bitscore
staxids
All subject tax ids
qlen
Full length of the read
slen
Full length of the subject

After these columns the data from the database follows.


CAZyDB

taxid
Tax id
CAZyECs
Space seperated list of CAZy classes
Tax columns
Taxonomy of the entry

HydDB

taxid
Tax id
Date1
Added to the DB (?)
Date2
Updated to the DB (?)
HydDB_species
Species in HydDB
HydrogenaseClass
HydDB class
Sequence
Aminoacid sequence
nt_sequence1
Not always set
nt_sequence2
Not always set
HydDB_phylum
HydDB phylum
HydDB_order
HydDB order
PredictedActivity
Type of hydrogenase activity
PredictedOxyTolerance
Type of oxygen tolerance
PredictedSubunitsNumber
Number of subunits
PredictedMetalCentres
Ion center
PredictedSubunits
Subunit types
Tax columns
(Current) Taxonomy of the entry

Custom

Entry
Entry ID
Entry name
Entry name
Status
reviewed or unreviewed
Protein names
Human readable name
Gene names
Genes encoding for this protein
Organism
Human readable organism name
Cross-reference (CAZy)
Annotation in the CAZy project
Pathway
UniProt Function: Pathway
EC number
UniProt EC numbers
Organism ID
Tax id of the organism
Taxonomic lineage IDs
Should be the same as Organism ID
Sequence
Full aminoacid sequences
Cross-reference (KO)
KEGG Orthology(KO) terms, only for entries in KEGG
Tax columns
Taxonomy of the entry (based on the Organism ID)

Defining custom databases

Additional databases can be added by placing queries to the UniProtKB database in data/readanno_DATABASE.tsv.

The first row is ignored and the second column is used for the terms, they are chained with OR.

group	query
Formate dehydrogenase ec:"1.17.1.9"
Formate dehydrogenase (NADP(+))	ec:"1.17.1.10"
Formate dehydrogenase (NAD(+), ferredoxin)	ec:"1.17.1.11"
Formate dehydrogenase-N	ec:"1.17.5.3" 
Formate dehydrogenase (coenzyme F420)	ec:"1.17.98.3" OR database:(type:ko k22516) OR database:(type:ko k22516) OR database:(type:ko k00125)
Formate dehydrogenase (acceptor)	ec:"1.17.99.7"