Link Search Menu Expand Document

Assembly

Assembly of the reads is performed with MEGAHIT. Only adapter trimming (if enabled) is run on the reads for assembly, no quality trimming. This is recommended by the authors of MEGAHIT. MEGAHIT is run with the ouput of BBmerge, including the unmerged reads. For read mapping the BBMerge input reads are used. The whole assembly can be classified with Sourmash, each contig can be classified with CAT. MetaQUAST is run on each assembly.

MetaQUAST reference download is turned off by default, as parallel downloads of the same reference for two different samples results in invalid files and crashes.

The steps for this stage are defined in the Snakefile_4Assembly file. Sourmash uses the same reference data (GenomeDB based) for contig classification and read classification. CAT classification is based on NCBIs nr database.

Small metagenome
Tbd. GB
Medium metagenome
Tbd. GB
Large metagenome
Tbd. GB

Assembly and read mapping data can be generated with assembly and the report is produced by assemblyReport. A Sourmash description of the whole assembly is available with assemblySourmash and single contig classification with CAT can be executed with cat. A Krona report for the the contigs is available with catKrona and assemblySourmashKrona. Reference data for CAT (and BAT) can be generated with the catRefData rule. The CAT contig data is reused for running BAT on the bins later. Reference data (excluding CAT/BAT) is the input for assemblyRefData.

Outputs are stored in <outputdir>/Assembly/, <outputdir>/Data/Assembly/ and <outputdir>/Reports/.


Output files

output
├── output/Assembly
│   ├── output/Assembly/CAT
│   │   └── output/Assembly/CAT/SAMPLE
│   │       ├── output/Assembly/CAT/SAMPLE/SAMPLE.alignment.diamond
│   │       ├── output/Assembly/CAT/SAMPLE/SAMPLE_CAT.log
│   │       ├── output/Assembly/CAT/SAMPLE/SAMPLE.contig2classification.named.txt
│   │       ├── output/Assembly/CAT/SAMPLE/SAMPLE.contig2classification.txt
│   │       ├── output/Assembly/CAT/SAMPLE/SAMPLE.ORF2LCA.txt
│   │       ├── output/Assembly/CAT/SAMPLE/SAMPLE.predicted_proteins.faa
│   │       └── output/Assembly/CAT/SAMPLE/SAMPLE.predicted_proteins.gff
│   ├── output/Assembly/MEGAHIT
│   │   ├── output/Assembly/MEGAHIT/SAMPLE
│   │   │   ├── output/Assembly/MEGAHIT/SAMPLE/checkpoints.txt
│   │   │   ├── output/Assembly/MEGAHIT/SAMPLE/done
│   │   │   ├── output/Assembly/MEGAHIT/SAMPLE/final.contigs.fa
│   │   │   ├── output/Assembly/MEGAHIT/SAMPLE/intermediate_contigs
│   │   │   │   ├── ...
│   │   │   ├── output/Assembly/MEGAHIT/SAMPLE/log
│   │   │   └── output/Assembly/MEGAHIT/SAMPLE/options.json
│   │   └── output/Assembly/MEGAHIT/SAMPLE_megahit.log
│   ├── output/Assembly/MetaQUAST
│   │   ├── output/Assembly/MetaQUAST/SAMPLE
│   │   │   └── output/Assembly/MetaQUAST/SAMPLE/SAMPLE_metaquast
│   │   │       ├── ...
│   │   │       ├── output/Assembly/MetaQUAST/SAMPLE/SAMPLE_metaquast/report.pdf
│   │   │       └── output/Assembly/MetaQUAST/SAMPLE/SAMPLE_metaquast/transposed_report.txt
│   │   └── output/Assembly/MetaQUAST/SAMPLE_metaquast.log
│   ├── output/Assembly/Readmapping
│   │   └── output/Assembly/Readmapping/SAMPLE
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_jgi_summarize_depth.log
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_jgi_summarize_depth.tsv
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_pileup_cov.log
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_pileup_unmapped_merged.fastq.gz
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_pileup_unmapped_R1.fastq.gz
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_pileup_unmapped_R2.fastq.gz
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_readmapping.bam
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_readmapping.bam.bai
│   │       ├── output/Assembly/Readmapping/SAMPLE/SAMPLE_readmapping.log
│   │       └── output/Assembly/Readmapping/SAMPLE/SAMPLE_readmapping.sam
│   └── output/Assembly/sourmash
│       └── output/Assembly/sourmash/SAMPLE
│           ├── output/Assembly/sourmash/SAMPLE/SAMPLE_contigs_sourmash_k51.log
│           ├── output/Assembly/sourmash/SAMPLE/SAMPLE_contigs_sourmash_k51.sig
│           └── output/Assembly/sourmash/SAMPLE/SAMPLE_contigs_sourmash_lca.log
├── output/Benchmarking
│   ├── output/Benchmarking/Assembly
│   │   ├── output/Benchmarking/Assembly/assembly_report.tsv
│   │   ├── output/Benchmarking/Assembly/cat_SAMPLE.tsv
│   │   ├── output/Benchmarking/Assembly/MEGAHIT_SAMPLE.tsv
│   │   ├── output/Benchmarking/Assembly/metaquast_SAMPLE.tsv
│   │   ├── output/Benchmarking/Assembly/pileup_SAMPLE.tsv
│   │   ├── output/Benchmarking/Assembly/read_mapping_SAMPLE.tsv
│   │   ├── output/Benchmarking/Assembly/sourmash_contigs_k51_SAMPLE.tsv
│   │   └── output/Benchmarking/Assembly/sourmash_contigs_lca_SAMPLE.tsv
├── output/Data
│   └── output/Data/Assembly
│       ├── output/Data/Assembly/SAMPLE_contigs_cat.tsv
│       ├── output/Data/Assembly/SAMPLE_contigs_sourmash_lca.tsv
│       ├── output/Data/Assembly/SAMPLE_cov_jgi_details.tsv
│       ├── output/Data/Assembly/SAMPLE_cov_pileup_details.tsv
│       ├── output/Data/Assembly/SAMPLE_cov_pileup_summary.tsv
│       └── output/Data/Assembly/SAMPLE_metaquast.tsv
└── output/Reports
    ├── output/Reports/7assembly.html
    ├── output/Reports/CAT_krona.html
    ├── output/Reports/MetaQUAST_SAMPLE.html
    └── output/Reports/sourmash_contig_krona.html


See the general file information for the Benchmarking files. MEGAHIT adds “fields” to every contig description:

flag
Connectivity, 1: Standalone, 2: Looped Path, 0: Other
multi
Very rough coverage
len
Length

Contig coverage/depth

Two different coverage programs are applied to the mapped reads. SAMPLE_cov_jgi_details.tsv is produced with the METABAT2 tool jgi_summarize_bam_contig_depths and +SAMPLE_cov_pileup_details.tsv with pileup.sh from BBMap.

SAMPLE_cov_pileup_summary.tsv has a single row for the sample with the following columns:

reads
Number of read pairs in the sample
mappedreads
Number of read pairs that mapped to the assembly
mappedbp
Number of base pairs that mapped
contigs
Number of contigs in the assembly
contigbp
Number of base pairs in the assembly
properpairsperc
Percentage of mapped read pairs, that were mapped together
avgcov
Average coverage
stddev
Standard deviation of the coverage
contigswithanycovperc
Percentage of contigs which had any coverage
bpswithanycovperc
Percentage of base pairs which had any coverage

On the other hand SAMPLE_cov_pileup_details.tsv contains one row for every contig:

contig
Full name of the contig
avgcov
Average coverage
len
Contig length
refgc
Contig GC content
covperc
Percentage of bases covered
plus_reads
Number of reads that mapped on the + sequence
minus_reads
Number of reads that mapped on the reverse complemented sequence
readgc
Average read GC content
mediancov
Median coverage
stddev
Standard deviation of the coverage
covbases
Bases with coverage

SAMPLE_cov_jgi_details.tsv only contains 3 columns:

contig
Full name of the contig
len
Contig length
avgcov
Average coverage
var
Coverage variance

Contig classification

The SAMPLE_contigs_cat.tsv contains the CAT output for every contig.

contig
Full name of the contig
classified
Was this contig classified (unclassified/classified)
reason
Reason (ORF hit count) for this classification
lineage
; seperated tax id lineage
lineage scores
Fraction of the ORFs that agree on the different levels (; seperated)
Tax columns (superkingdom)
The tax names ("*" from CAT removed)
Tax scores (superkingdom_score)
Fraction of the ORFs that agree on that classification
Monocladic tax columns (superkingdom_monocladic)
This classification has no siblings in the database (CAT "*" mark)

SAMPLE_contigs_sourmash_lca.tsv uses the same format as the read classification Sourmash LCA file. The same scaling factor (1000 default) is applied to the assembly.