Assembly

Assembly of the reads is performed with MEGAHIT. Only adapter trimming (if enabled) is run on the reads for assembly, no quality trimming. This is recommended by the authors of MEGAHIT. MEGAHIT is run with the ouput of BBmerge, including the unmerged reads. For read mapping the BBMerge input reads are used. The whole assembly can be classified with Sourmash, each contig can be classified with CAT. MetaQUAST is run on each assembly.

MetaQUAST reference download is turned off by default, as parallel downloads of the same reference for two different samples results in invalid files and crashes.

The steps for this stage are defined in the Snakefile_4Assembly file. Sourmash uses the same reference data (GenomeDB based) for contig classification and read classification. CAT classification is based on NCBIs nr database.

Small metagenome: Tbd. GB
Medium metagenome: Tbd. GB
Large metagenome: Tbd. GB

Assembly and read mapping data can be generated with assembly and the report is produced by assemblyReport. A Sourmash description of the whole assembly is available with assemblySourmash and single contig classification with CAT can be executed with cat. A Krona report for the the contigs is available with catKrona and assemblySourmashKrona. Reference data for CAT (and BAT) can be generated with the catRefData rule. The CAT contig data is reused for running BAT on the bins later. Reference data (excluding CAT/BAT) is the input for assemblyRefData.

Outputs are stored in <outputdir>/Assembly/, <outputdir>/Data/Assembly/ and <outputdir>/Reports/.

Output files

output
├── output/Assembly
│   ├── output/Assembly/CAT
│   │   └── output/Assembly/CAT/SAMPLE
│   │       ├── output/Assembly/CAT/SAMPLE/SAMPLE.alignment.diamond
│   │       ├── output/Assembly/CAT/SAMPLE/SAMPLE_CAT.log
│   │       ├── output/Assembly/CAT/SAMPLE/SAMPLE.contig2classification.named.txt
│   │       ├── output/Assembly/CAT/SAMPLE/SAMPLE.contig2classification.txt
│   │       ├── output/Assembly/CAT/SAMPLE/SAMPLE.ORF2LCA.txt
│   │       ├── output/Assembly/CAT/SAMPLE/SAMPLE.predicted_proteins.faa
│   │       └── output/Assembly/CAT/SAMPLE/SAMPLE.predicted_proteins.gff
│   ├── output/Assembly/MEGAHIT
│   │   ├── output/Assembly/MEGAHIT/SAMPLE
│   │   │   ├── output/Assembly/MEGAHIT/SAMPLE/checkpoints.txt
│   │   │   ├── output/Assembly/MEGAHIT/SAMPLE/done
│   │   │   ├── output/Assembly/MEGAHIT/SAMPLE/final.contigs.fa
│   │   │   ├── output/Assembly/MEGAHIT/SAMPLE/intermediate_contigs
│   │   │   │   ├── ...
│   │   │   ├── output/Assembly/MEGAHIT/SAMPLE/log
│   │   │   └── output/Assembly/MEGAHIT/SAMPLE/options.json
│   │   └── output/Assembly/MEGAHIT/SAMPLE_megahit.log
│   ├── output/Assembly/MetaQUAST
│   │   ├── output/Assembly/MetaQUAST/SAMPLE
│   │   │   └── output/Assembly/MetaQUAST/SAMPLE/SAMPLE_metaquast
│   │   │       ├── ...
│   │   │       ├── output/Assembly/MetaQUAST/SAMPLE/SAMPLE_metaquast/report.pdf
│   │   │       └── output/Assembly/MetaQUAST/SAMPLE/SAMPLE_metaquast/transposed_report.txt
│   │   └── output/Assembly/MetaQUAST/SAMPLE_metaquast.log
│   ├── output/Assembly/Readmapping
│   │   └── output/Assembly/Readmapping/SAMPLE
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_jgi_summarize_depth.log
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_jgi_summarize_depth.tsv
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_pileup_cov.log
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_pileup_unmapped_merged.fastq.gz
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_pileup_unmapped_R1.fastq.gz
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_pileup_unmapped_R2.fastq.gz
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_readmapping.bam
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_readmapping.bam.bai
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_readmapping.log
│   │       └── output/Assembly/Readmapping/SAMPLE/SAMPLE_readmapping.sam
│   └── output/Assembly/sourmash
│       └── output/Assembly/sourmash/SAMPLE
│           ├── output/Assembly/sourmash/SAMPLE/SAMPLE_contigs_sourmash_k51.log
│           ├── output/Assembly/sourmash/SAMPLE/SAMPLE_contigs_sourmash_k51.sig
│           └── output/Assembly/sourmash/SAMPLE/SAMPLE_contigs_sourmash_lca.log
├── output/Benchmarking
│   ├── output/Benchmarking/Assembly
│   │   ├── output/Benchmarking/Assembly/assembly_report.tsv
│   │   ├── output/Benchmarking/Assembly/cat_SAMPLE.tsv
│   │   ├── output/Benchmarking/Assembly/MEGAHIT_SAMPLE.tsv
│   │   ├── output/Benchmarking/Assembly/metaquast_SAMPLE.tsv
│   │   ├── output/Benchmarking/Assembly/pileup_SAMPLE.tsv
│   │   ├── output/Benchmarking/Assembly/read_mapping_SAMPLE.tsv
│   │   ├── output/Benchmarking/Assembly/sourmash_contigs_k51_SAMPLE.tsv
│   │   └── output/Benchmarking/Assembly/sourmash_contigs_lca_SAMPLE.tsv
├── output/Data
│   └── output/Data/Assembly
│       ├── output/Data/Assembly/SAMPLE_contigs_cat.tsv
│       ├── output/Data/Assembly/SAMPLE_contigs_sourmash_lca.tsv
│       ├── output/Data/Assembly/SAMPLE_cov_jgi_details.tsv
│       ├── output/Data/Assembly/SAMPLE_cov_pileup_details.tsv
│       ├── output/Data/Assembly/SAMPLE_cov_pileup_summary.tsv
│       └── output/Data/Assembly/SAMPLE_metaquast.tsv
└── output/Reports
    ├── output/Reports/7assembly.html
    ├── output/Reports/CAT_krona.html
    ├── output/Reports/MetaQUAST_SAMPLE.html
    └── output/Reports/sourmash_contig_krona.html

See the general file information for the Benchmarking files. MEGAHIT adds “fields” to every contig description:

flag: Connectivity, 1: Standalone, 2: Looped Path, 0: Other
multi: Very rough coverage
len: Length

Contig coverage/depth

Two different coverage programs are applied to the mapped reads. SAMPLE_cov_jgi_details.tsv is produced with the METABAT2 tool jgi_summarize_bam_contig_depths and +SAMPLE_cov_pileup_details.tsv with pileup.sh from BBMap.

SAMPLE_cov_pileup_summary.tsv has a single row for the sample with the following columns:

reads: Number of read pairs in the sample
mappedreads: Number of read pairs that mapped to the assembly
mappedbp: Number of base pairs that mapped
contigs: Number of contigs in the assembly
contigbp: Number of base pairs in the assembly
properpairsperc: Percentage of mapped read pairs, that were mapped together
avgcov: Average coverage
stddev: Standard deviation of the coverage
contigswithanycovperc: Percentage of contigs which had any coverage
bpswithanycovperc: Percentage of base pairs which had any coverage

On the other hand SAMPLE_cov_pileup_details.tsv contains one row for every contig:

contig: Full name of the contig
avgcov: Average coverage
len: Contig length
refgc: Contig GC content
covperc: Percentage of bases covered
plus_reads: Number of reads that mapped on the + sequence
minus_reads: Number of reads that mapped on the reverse complemented sequence
readgc: Average read GC content
mediancov: Median coverage
stddev: Standard deviation of the coverage
covbases: Bases with coverage

SAMPLE_cov_jgi_details.tsv only contains 3 columns:

contig: Full name of the contig
len: Contig length
avgcov: Average coverage
var: Coverage variance

Contig classification

The SAMPLE_contigs_cat.tsv contains the CAT output for every contig.

contig: Full name of the contig
classified: Was this contig classified (unclassified/classified)
reason: Reason (ORF hit count) for this classification
lineage: ; seperated tax id lineage
lineage scores: Fraction of the ORFs that agree on the different levels (; seperated)
Tax columns (superkingdom): The tax names ("*" from CAT removed)
Tax scores (superkingdom_score): Fraction of the ORFs that agree on that classification
Monocladic tax columns (superkingdom_monocladic): This classification has no siblings in the database (CAT "*" mark)

SAMPLE_contigs_sourmash_lca.tsv uses the same format as the read classification Sourmash LCA file. The same scaling factor (1000 default) is applied to the assembly.