Assembly
Assembly of the reads is performed with MEGAHIT. Only adapter trimming (if enabled) is run on the reads for assembly, no quality trimming. This is recommended by the authors of MEGAHIT. MEGAHIT is run with the ouput of BBmerge, including the unmerged reads. For read mapping the BBMerge input reads are used. The whole assembly can be classified with Sourmash, each contig can be classified with CAT. MetaQUAST is run on each assembly.
MetaQUAST reference download is turned off by default, as parallel downloads of the same reference for two different samples results in invalid files and crashes.
The steps for this stage are defined in the Snakefile_4Assembly
file. Sourmash uses the same reference data (GenomeDB based) for contig classification and read classification. CAT classification is based on NCBIs nr database.
- Small metagenome
- Tbd. GB
- Medium metagenome
- Tbd. GB
- Large metagenome
- Tbd. GB
Assembly and read mapping data can be generated with assembly
and the report is produced by assemblyReport
. A Sourmash description of the whole assembly is available with assemblySourmash
and single contig classification with CAT can be executed with cat
. A Krona report for the the contigs is available with catKrona
and assemblySourmashKrona
. Reference data for CAT (and BAT) can be generated with the catRefData
rule. The CAT contig data is reused for running BAT on the bins later. Reference data (excluding CAT/BAT) is the input for assemblyRefData
.
Outputs are stored in <outputdir>/Assembly/
, <outputdir>/Data/Assembly/
and <outputdir>/Reports/
.
Output files
output
├── output/Assembly
│ ├── output/Assembly/CAT
│ │ └── output/Assembly/CAT/SAMPLE
│ │ ├── output/Assembly/CAT/SAMPLE/SAMPLE.alignment.diamond
│ │ ├── output/Assembly/CAT/SAMPLE/SAMPLE_CAT.log
│ │ ├── output/Assembly/CAT/SAMPLE/SAMPLE.contig2classification.named.txt
│ │ ├── output/Assembly/CAT/SAMPLE/SAMPLE.contig2classification.txt
│ │ ├── output/Assembly/CAT/SAMPLE/SAMPLE.ORF2LCA.txt
│ │ ├── output/Assembly/CAT/SAMPLE/SAMPLE.predicted_proteins.faa
│ │ └── output/Assembly/CAT/SAMPLE/SAMPLE.predicted_proteins.gff
│ ├── output/Assembly/MEGAHIT
│ │ ├── output/Assembly/MEGAHIT/SAMPLE
│ │ │ ├── output/Assembly/MEGAHIT/SAMPLE/checkpoints.txt
│ │ │ ├── output/Assembly/MEGAHIT/SAMPLE/done
│ │ │ ├── output/Assembly/MEGAHIT/SAMPLE/final.contigs.fa
│ │ │ ├── output/Assembly/MEGAHIT/SAMPLE/intermediate_contigs
│ │ │ │ ├── ...
│ │ │ ├── output/Assembly/MEGAHIT/SAMPLE/log
│ │ │ └── output/Assembly/MEGAHIT/SAMPLE/options.json
│ │ └── output/Assembly/MEGAHIT/SAMPLE_megahit.log
│ ├── output/Assembly/MetaQUAST
│ │ ├── output/Assembly/MetaQUAST/SAMPLE
│ │ │ └── output/Assembly/MetaQUAST/SAMPLE/SAMPLE_metaquast
│ │ │ ├── ...
│ │ │ ├── output/Assembly/MetaQUAST/SAMPLE/SAMPLE_metaquast/report.pdf
│ │ │ └── output/Assembly/MetaQUAST/SAMPLE/SAMPLE_metaquast/transposed_report.txt
│ │ └── output/Assembly/MetaQUAST/SAMPLE_metaquast.log
│ ├── output/Assembly/Readmapping
│ │ └── output/Assembly/Readmapping/SAMPLE
│ │ ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_jgi_summarize_depth.log
│ │ ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_jgi_summarize_depth.tsv
│ │ ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_pileup_cov.log
│ │ ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_pileup_unmapped_merged.fastq.gz
│ │ ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_pileup_unmapped_R1.fastq.gz
│ │ ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_pileup_unmapped_R2.fastq.gz
│ │ ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_readmapping.bam
│ │ ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_readmapping.bam.bai
│ │ ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_readmapping.log
│ │ └── output/Assembly/Readmapping/SAMPLE/SAMPLE_readmapping.sam
│ └── output/Assembly/sourmash
│ └── output/Assembly/sourmash/SAMPLE
│ ├── output/Assembly/sourmash/SAMPLE/SAMPLE_contigs_sourmash_k51.log
│ ├── output/Assembly/sourmash/SAMPLE/SAMPLE_contigs_sourmash_k51.sig
│ └── output/Assembly/sourmash/SAMPLE/SAMPLE_contigs_sourmash_lca.log
├── output/Benchmarking
│ ├── output/Benchmarking/Assembly
│ │ ├── output/Benchmarking/Assembly/assembly_report.tsv
│ │ ├── output/Benchmarking/Assembly/cat_SAMPLE.tsv
│ │ ├── output/Benchmarking/Assembly/MEGAHIT_SAMPLE.tsv
│ │ ├── output/Benchmarking/Assembly/metaquast_SAMPLE.tsv
│ │ ├── output/Benchmarking/Assembly/pileup_SAMPLE.tsv
│ │ ├── output/Benchmarking/Assembly/read_mapping_SAMPLE.tsv
│ │ ├── output/Benchmarking/Assembly/sourmash_contigs_k51_SAMPLE.tsv
│ │ └── output/Benchmarking/Assembly/sourmash_contigs_lca_SAMPLE.tsv
├── output/Data
│ └── output/Data/Assembly
│ ├── output/Data/Assembly/SAMPLE_contigs_cat.tsv
│ ├── output/Data/Assembly/SAMPLE_contigs_sourmash_lca.tsv
│ ├── output/Data/Assembly/SAMPLE_cov_jgi_details.tsv
│ ├── output/Data/Assembly/SAMPLE_cov_pileup_details.tsv
│ ├── output/Data/Assembly/SAMPLE_cov_pileup_summary.tsv
│ └── output/Data/Assembly/SAMPLE_metaquast.tsv
└── output/Reports
├── output/Reports/7assembly.html
├── output/Reports/CAT_krona.html
├── output/Reports/MetaQUAST_SAMPLE.html
└── output/Reports/sourmash_contig_krona.html
See the general file information for the Benchmarking
files. MEGAHIT adds “fields” to every contig description:
- flag
- Connectivity, 1: Standalone, 2: Looped Path, 0: Other
- multi
- Very rough coverage
- len
- Length
Contig coverage/depth
Two different coverage programs are applied to the mapped reads. SAMPLE_cov_jgi_details.tsv
is produced with the METABAT2 tool jgi_summarize_bam_contig_depths
and +SAMPLE_cov_pileup_details.tsv
with pileup.sh
from BBMap.
SAMPLE_cov_pileup_summary.tsv
has a single row for the sample with the following columns:
- reads
- Number of read pairs in the sample
- mappedreads
- Number of read pairs that mapped to the assembly
- mappedbp
- Number of base pairs that mapped
- contigs
- Number of contigs in the assembly
- contigbp
- Number of base pairs in the assembly
- properpairsperc
- Percentage of mapped read pairs, that were mapped together
- avgcov
- Average coverage
- stddev
- Standard deviation of the coverage
- contigswithanycovperc
- Percentage of contigs which had any coverage
- bpswithanycovperc
- Percentage of base pairs which had any coverage
On the other hand SAMPLE_cov_pileup_details.tsv
contains one row for every contig:
- contig
- Full name of the contig
- avgcov
- Average coverage
- len
- Contig length
- refgc
- Contig GC content
- covperc
- Percentage of bases covered
- plus_reads
- Number of reads that mapped on the + sequence
- minus_reads
- Number of reads that mapped on the reverse complemented sequence
- readgc
- Average read GC content
- mediancov
- Median coverage
- stddev
- Standard deviation of the coverage
- covbases
- Bases with coverage
SAMPLE_cov_jgi_details.tsv
only contains 3 columns:
- contig
- Full name of the contig
- len
- Contig length
- avgcov
- Average coverage
- var
- Coverage variance
Contig classification
The SAMPLE_contigs_cat.tsv
contains the CAT output for every contig.
- contig
- Full name of the contig
- classified
- Was this contig classified (unclassified/classified)
- reason
- Reason (ORF hit count) for this classification
- lineage
- ; seperated tax id lineage
- lineage scores
- Fraction of the ORFs that agree on the different levels (; seperated)
- Tax columns (superkingdom)
- The tax names ("*" from CAT removed)
- Tax scores (superkingdom_score)
- Fraction of the ORFs that agree on that classification
- Monocladic tax columns (superkingdom_monocladic)
- This classification has no siblings in the database (CAT "*" mark)
SAMPLE_contigs_sourmash_lca.tsv
uses the same format as the read classification Sourmash LCA file. The same scaling factor (1000 default) is applied to the assembly.