Read annotation

Read Annotation

For some applications running an assembly and binning process might not be necessary. KnuttReads2Bins uses DIAMOND BLASTX searches against protein databases to provide compositional insights into a sample. The problems with a simple BLAST search for functional annotation should be considered, when using data from this step.

The HydDB and dbCAN (CAZyDB) protein databases are integrated into the workflow. Users can create custom databases by providing UniProtKB queries. The Krona reports use the taxonomy of the entries in the databases and the BLAST search results, but for dbCAN/CAZy an additional functional Krona is available.

The steps for this stage are defined in the Snakefile_3AnnotateReads file. With two custom databases the reference data requires around 1 GB and 9 GB for the NCBI taxonomy.

Small metagenome: Tbd. GB
Medium metagenome: Tbd. GB
Large metagenome: Tbd. GB

Reference data can be generated with the rule readAnnoRefData. All classification steps can be run with the readAnno, readAnnoKrona and readAnnoReport rules.

Outputs are stored in <outputdir>/ReadAnnotation/, <outputdir>/Data/ReadAnnotation/ and <outputdir>/Reports/.

The files for specific databases can be produced with readAnnoDATABASEDBRefData, readAnnoDATABASE and readAnnoDATABASEKrona. The functional Krona is currently only available for CAZy. DATABASE can be Custom for the UniProtKB based databases.

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadAnnotation
│       └── output/Benchmarking/ReadAnnotation/diamond_run_DATABASE_SAMPLE.tsv
├── output/Data
│   └── output/Data/ReadAnnotation
│       └── output/Data/ReadAnnotation/SAMPLE_readanno_DATABASE.tsv
├── output/ReadAnnotation
│   └── output/ReadAnnotation/SAMPLE
│       ├── output/ReadAnnotation/SAMPLE/SAMPLE_prot_DATABASE_blast.log
│       └── output/ReadAnnotation/SAMPLE/SAMPLE_prot_DATABASE_blast.tsv
└── output/Reports
    ├── output/Reports/6readanno_DATABASE.html
    ├── output/Reports/readanno_CAZyDB_funkrona.html
    └── output/Reports/readanno_DATABASE_krona.html

Current database options are:

CAzyDB: dbCAN2 project based database
HydDB: Sequences in the hydrogenase database project
Custom: Produce files for all custom databases

See the general file information for the Benchmarking files. The SAMPLE_readanno_DATABASE.tsv file contains the BLAST hits with the lowest e-value for each query.

qname: Full id of the read
sname: Full id of the subject
pident: Percentage of identical base pairs in the alignment
length: Alignment length
mismatch: Number of mismatches
gapopen: Number of gap sequences opened
qstart: Start in the read
qend: End in the read
sstart: Start on the subject
send: End on the subject
evalue: E value
bitscore: Bitscore
staxids: All subject tax ids
qlen: Full length of the read
slen: Full length of the subject

After these columns the data from the database follows.

CAZyDB

taxid: Tax id
CAZyECs: Space seperated list of CAZy classes
Tax columns: Taxonomy of the entry

HydDB

taxid: Tax id
Date1: Added to the DB (?)
Date2: Updated to the DB (?)
HydDB_species: Species in HydDB
HydrogenaseClass: HydDB class
Sequence: Aminoacid sequence
nt_sequence1: Not always set
nt_sequence2: Not always set
HydDB_phylum: HydDB phylum
HydDB_order: HydDB order
PredictedActivity: Type of hydrogenase activity
PredictedOxyTolerance: Type of oxygen tolerance
PredictedSubunitsNumber: Number of subunits
PredictedMetalCentres: Ion center
PredictedSubunits: Subunit types
Tax columns: (Current) Taxonomy of the entry

Custom

Entry: Entry ID
Entry name: Entry name
Status: reviewed or unreviewed
Protein names: Human readable name
Gene names: Genes encoding for this protein
Organism: Human readable organism name
Cross-reference (CAZy): Annotation in the CAZy project
Pathway: UniProt Function: Pathway
EC number: UniProt EC numbers
Organism ID: Tax id of the organism
Taxonomic lineage IDs: Should be the same as Organism ID
Sequence: Full aminoacid sequences
Cross-reference (KO): KEGG Orthology(KO) terms, only for entries in KEGG
Tax columns: Taxonomy of the entry (based on the Organism ID)

Defining custom databases

Additional databases can be added by placing queries to the UniProtKB database in data/readanno_DATABASE.tsv.

The first row is ignored and the second column is used for the terms, they are chained with OR.

group	query
Formate dehydrogenase ec:"1.17.1.9"
Formate dehydrogenase (NADP(+))	ec:"1.17.1.10"
Formate dehydrogenase (NAD(+), ferredoxin)	ec:"1.17.1.11"
Formate dehydrogenase-N	ec:"1.17.5.3" 
Formate dehydrogenase (coenzyme F420)	ec:"1.17.98.3" OR database:(type:ko k22516) OR database:(type:ko k22516) OR database:(type:ko k00125)
Formate dehydrogenase (acceptor)	ec:"1.17.99.7"