Read Annotation
For some applications running an assembly and binning process might not be necessary. KnuttReads2Bins uses DIAMOND BLASTX searches against protein databases to provide compositional insights into a sample. The problems with a simple BLAST search for functional annotation should be considered, when using data from this step.
The HydDB and dbCAN (CAZyDB) protein databases are integrated into the workflow. Users can create custom databases by providing UniProtKB queries. The Krona reports use the taxonomy of the entries in the databases and the BLAST search results, but for dbCAN/CAZy an additional functional Krona is available.
The steps for this stage are defined in the Snakefile_3AnnotateReads
file. With two custom databases the reference data requires around 1 GB and 9 GB for the NCBI taxonomy.
- Small metagenome
- Tbd. GB
- Medium metagenome
- Tbd. GB
- Large metagenome
- Tbd. GB
Reference data can be generated with the rule readAnnoRefData
. All classification steps can be run with the readAnno
, readAnnoKrona
and readAnnoReport
rules.
Outputs are stored in <outputdir>/ReadAnnotation/
, <outputdir>/Data/ReadAnnotation/
and <outputdir>/Reports/
.
The files for specific databases can be produced with readAnnoDATABASEDBRefData
, readAnnoDATABASE
and readAnnoDATABASEKrona
. The functional Krona is currently only available for CAZy. DATABASE
can be Custom
for the UniProtKB based databases.
Output files
output
├── output/Benchmarking
│ └── output/Benchmarking/ReadAnnotation
│ └── output/Benchmarking/ReadAnnotation/diamond_run_DATABASE_SAMPLE.tsv
├── output/Data
│ └── output/Data/ReadAnnotation
│ └── output/Data/ReadAnnotation/SAMPLE_readanno_DATABASE.tsv
├── output/ReadAnnotation
│ └── output/ReadAnnotation/SAMPLE
│ ├── output/ReadAnnotation/SAMPLE/SAMPLE_prot_DATABASE_blast.log
│ └── output/ReadAnnotation/SAMPLE/SAMPLE_prot_DATABASE_blast.tsv
└── output/Reports
├── output/Reports/6readanno_DATABASE.html
├── output/Reports/readanno_CAZyDB_funkrona.html
└── output/Reports/readanno_DATABASE_krona.html
Current database options are:
- CAzyDB
- dbCAN2 project based database
- HydDB
- Sequences in the hydrogenase database project
- Custom
- Produce files for all custom databases
See the general file information for the Benchmarking
files. The SAMPLE_readanno_DATABASE.tsv
file contains the BLAST hits with the lowest e-value for each query.
- qname
- Full id of the read
- sname
- Full id of the subject
- pident
- Percentage of identical base pairs in the alignment
- length
- Alignment length
- mismatch
- Number of mismatches
- gapopen
- Number of gap sequences opened
- qstart
- Start in the read
- qend
- End in the read
- sstart
- Start on the subject
- send
- End on the subject
- evalue
- E value
- bitscore
- Bitscore
- staxids
- All subject tax ids
- qlen
- Full length of the read
- slen
- Full length of the subject
After these columns the data from the database follows.
CAZyDB
- taxid
- Tax id
- CAZyECs
- Space seperated list of CAZy classes
- Tax columns
- Taxonomy of the entry
HydDB
- taxid
- Tax id
- Date1
- Added to the DB (?)
- Date2
- Updated to the DB (?)
- HydDB_species
- Species in HydDB
- HydrogenaseClass
- HydDB class
- Sequence
- Aminoacid sequence
- nt_sequence1
- Not always set
- nt_sequence2
- Not always set
- HydDB_phylum
- HydDB phylum
- HydDB_order
- HydDB order
- PredictedActivity
- Type of hydrogenase activity
- PredictedOxyTolerance
- Type of oxygen tolerance
- PredictedSubunitsNumber
- Number of subunits
- PredictedMetalCentres
- Ion center
- PredictedSubunits
- Subunit types
- Tax columns
- (Current) Taxonomy of the entry
Custom
- Entry
- Entry ID
- Entry name
- Entry name
- Status
- reviewed or unreviewed
- Protein names
- Human readable name
- Gene names
- Genes encoding for this protein
- Organism
- Human readable organism name
- Cross-reference (CAZy)
- Annotation in the CAZy project
- Pathway
- UniProt Function: Pathway
- EC number
- UniProt EC numbers
- Organism ID
- Tax id of the organism
- Taxonomic lineage IDs
- Should be the same as Organism ID
- Sequence
- Full aminoacid sequences
- Cross-reference (KO)
- KEGG Orthology(KO) terms, only for entries in KEGG
- Tax columns
- Taxonomy of the entry (based on the Organism ID)
Defining custom databases
Additional databases can be added by placing queries to the UniProtKB database in data/readanno_DATABASE.tsv
.
The first row is ignored and the second column is used for the terms, they are chained with OR
.
group query
Formate dehydrogenase ec:"1.17.1.9"
Formate dehydrogenase (NADP(+)) ec:"1.17.1.10"
Formate dehydrogenase (NAD(+), ferredoxin) ec:"1.17.1.11"
Formate dehydrogenase-N ec:"1.17.5.3"
Formate dehydrogenase (coenzyme F420) ec:"1.17.98.3" OR database:(type:ko k22516) OR database:(type:ko k22516) OR database:(type:ko k00125)
Formate dehydrogenase (acceptor) ec:"1.17.99.7"