Read Preparation

Before analysing the reads, certain technical artifacts have to be removed. Depending on the input data source, sequencing adapters may need to be trimmed, overlapping reads merged and for some steps quality trimming necessary.

The steps for this stage are defined in the Snakefile_1PrepareReads file. No reference data is required.

Small metagenome
Tbd. GB
Medium metagenome
Tbd. GB
Large metagenome
Tbd. GB

All preparation steps can be run with the prepareReads and prepareReadsReport rules.

Outputs are stored in <outputdir>/ReadPrep/, <outputdir>/Data/ReadPrep/ and <outputdir>/Reports/.

Used tools are cutadapt, BBmerge & BBMask, khmer and seqtk. R libraries for data processing include data.table, seqinr, plotly, ggplot, pbapply, flexdashboard and shortreads. Check the environment (envs/) files for more details.

Table of contents

  1. Raw sequencing data
  2. Adapter trimming
  3. Read merging
  4. Read quality trimming

Raw sequencing data

Sequence data files for all samples can be produced with the rule rawSeqData and FASTQC reports with rawFASTQC. A report on the user provided files can be created with rawReport.

Output files

├── output/Benchmarking
│   └── output/Benchmarking/ReadPrep
│       ├── output/Benchmarking/ReadPrep/raw_fastqc_SAMPLE_R1.tsv
│       ├── output/Benchmarking/ReadPrep/raw_fastqc_SAMPLE_R2.tsv
│       ├── output/Benchmarking/ReadPrep/raw_report.tsv
│       ├── output/Benchmarking/ReadPrep/raw_seqdata_SAMPLE_R1.tsv
│       └── output/Benchmarking/ReadPrep/raw_seqdata_SAMPLE_R2.tsv
├── output/Data
│   └── output/Data/ReadPrep
│       └── output/Data/ReadPrep/RawSequenceData
│           ├── output/Data/ReadPrep/RawSequenceData/SAMPLE_raw_seqdat_overview.tsv
│           └── output/Data/ReadPrep/RawSequenceData/SAMPLE_raw_seqdat_plotdata.tsv
└── output/Reports
    ├── output/Reports/1raw-reads.html
    └── output/Reports/FastQC
        ├── output/Reports/FastQC/raw_SAMPLE_R1_fastqc.html
        ├── output/Reports/FastQC/
        ├── output/Reports/FastQC/raw_SAMPLE_R2_fastqc.html
        └── output/Reports/FastQC/

See the general file information for the Benchmarking and seqdat files.

Adapter trimming

Even if you disable adapter trimming in the configuration file, it is still run to generate data, that should show trimming wasn’t necessary. Trimming is performed with the trim rule. Sequence data files for the trimmed files are produced with trimSeqData and the FASTQC reports with trimFASTQC.

The trimming report (trimReport) uses data from the merge operation, therefore it is included in the next section on merging.

Output files

├── output/Benchmarking
│   └── output/Benchmarking/ReadPrep
│       ├── output/Benchmarking/ReadPrep/trim_impact_SAMPLE_R1.tsv
│       ├── output/Benchmarking/ReadPrep/trim_impact_SAMPLE_R2.tsv
│       ├── output/Benchmarking/ReadPrep/trim_SAMPLE.tsv
│       ├── output/Benchmarking/ReadPrep/trim_seqdata_SAMPLE_R1.tsv
│       └── output/Benchmarking/ReadPrep/trim_seqdata_SAMPLE_R2.tsv
├── output/Data
│   └── output/Data/ReadPrep
│       └── output/Data/ReadPrep/AdapterTrimmed
│           ├── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr_impact.tsv
│           ├── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr_seqdat_overview.tsv
│           ├── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr_seqdat_plotdata.tsv
│           └── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr.tsv
├── output/ReadPrep
│   └── output/ReadPrep/AdapterTrimmed
│       └── output/ReadPrep/AdapterTrimmed/SAMPLE
│           ├── output/ReadPrep/AdapterTrimmed/SAMPLE/SAMPLE_R1_adptr_tr.fastq.gz
│           └── output/ReadPrep/AdapterTrimmed/SAMPLE/SAMPLE_R2_adptr_tr.fastq.gz
└── output/Reports
    └── output/Reports/FastQC
        ├── output/Reports/FastQC/trim_SAMPLE_R1_fastqc.html
        └── output/Reports/FastQC/trim_SAMPLE_R2_fastqc.html

See the general file information for the Benchmarking, impact and seqdat files.

The SAMPLE_adptr_tr.tsv file is the minimal report produced by cutadapt.

Incomplete Adapter warning, almost always OK
Number of reads in input
Base pairs (R1+R2) in input
Reads discarded for being too short
Reads discarded for being too long
Reads discarded for containing too many unspecified base calls
Reads written
R1 reads adapter trimmed
R1 base pairs quality trimmed
R1 base pairs written
R2 reads adapter trimmed
R2 base pairs quality trimmed
R2 base pairs written

Read merging

As the R1 and R2 reads from paired end sequencing may overlap, read merging is performed with BBmerge. Merging for all samples can be performed with merge. Sequence data is generated with mergeSeqData and FASTQC is run with mergeFASTQC. The trim report (trimReport) and merge report (mergeReport) both always use untrimmed and trimmed merging results for analysis.

Output files

├── output/Benchmarking
│   └── output/Benchmarking/ReadPrep
│       ├── output/Benchmarking/ReadPrep/merge_report.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_tr_merged.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_tr_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_tr_unmgd_R2.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_untr_merged.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_untr_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_untr_unmgd_R2.tsv
│       ├── output/Benchmarking/ReadPrep/merging_tr_SAMPLE.tsv
│       └── output/Benchmarking/ReadPrep/merging_untr_SAMPLE.tsv
├── output/Data
│   └── output/Data/ReadPrep
│       ├── output/Data/ReadPrep/Merging_tr
│       │   ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_insert_sizes.tsv
│       │   ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_overview.tsv
│       │   ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_seqdat_overview.tsv
│       │   └── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_seqdat_plotdata.tsv
│       └── output/Data/ReadPrep/Merging_untr
│           ├── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_insert_sizes.tsv
│           ├── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_overview.tsv
│           ├── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_seqdat_overview.tsv
│           └── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_seqdat_plotdata.tsv
├── output/ReadPrep
│   ├── output/ReadPrep/Merging_tr
│   │   └── output/ReadPrep/Merging_tr/SAMPLE
│   │       ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_adapters.fa
│   │       ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged.fastq.gz
│   │       ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merge.log
│   │       ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1.fastq.gz
│   │       └── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R2.fastq.gz
│   └── output/ReadPrep/Merging_untr
│       └── output/ReadPrep/Merging_untr/SAMPLE
│           ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_adapters.fa
│           ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_merged.fastq.gz
│           ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_merge.log
│           ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_unmgd_R1.fastq.gz
│           └── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_unmgd_R2.fastq.gz
└── output/Reports
    ├── output/Reports/2trimming.html
    ├── output/Reports/3read-merging.html
    └── output/Reports/FastQC
        ├── output/Reports/FastQC/merge_tr_SAMPLE_merged_fastqc.html
        ├── output/Reports/FastQC/merge_tr_SAMPLE_unmgd_R1_fastqc.html
        ├── output/Reports/FastQC/merge_tr_SAMPLE_unmgd_R2_fastqc.html
        ├── output/Reports/FastQC/merge_untr_SAMPLE_merged_fastqc.html
        ├── output/Reports/FastQC/merge_untr_SAMPLE_unmgd_R1_fastqc.html
        └── output/Reports/FastQC/merge_untr_SAMPLE_unmgd_R2_fastqc.html

See the general file information for the Benchmarking and seqdat files.

The SAMPLE_merge_{untr/tr}_insert_sizes.tsv files are directly produced by BBmerge. One line represents one read pair.

Sequence identifier as seen in the FASTQ file
Positional index of the sequence
Length of the new sequence resulting from the merge, -1 for failed merges
Character showing the merging success, can be F(ailed), I(mperfect) or P(erfect)
The number of mismatches during merging. 0 for perfect merges and missing for failed merges

For the SAMPLE_merge_{untr/tr}_overview.tsv file the BBmerge log output is parsed. The adapter output can be used to make sure that adapter trimming was successful. This is done by making sure the adaptercount field has a low value (compared to the joined/merged count).

Read pairs read from the FASTQ files
Number of pairs merged
Not merged pairs, because multiple different solutions were likely
Not mergeable pairs
Rejected pairs, because the resulting sequence would be too short
Shortest sequence length produced by merging
Median sequence length produced by merging
Longest sequence length produced by merging
Number of merged reads where a adapter read-through is expected
Conensus sequence of the suspected adapter sequence for R1
Conensus sequence of the suspected adapter sequence for R2

Read quality trimming

For some steps like read classification and direct database searches it makes sense to mask low complexity and low quality regions of the reads, as not all tools are quality aware. The merged and unmerged R1 reads are combined for these steps, because read pairings is another information some tools can’t process and R1+R2 hits on the same subject may not be correctlly accounted for. K-mer based error trimming using from khmer is performed before masking and quality scored based trimming.

If samples with different depths are compared using read analysis methods, sampling (rarefaction) of the reads can compensate for this difference. This sampling is configurable in the config file.

The quality trimming (and optional sampling) can be performed with the analysisReads, the sequence data files are produced with analysisReadsSeqData and the FASTQC reports with analysisReadsFASTQC. The report can be generated with analysisReadsReport.

├── output/Benchmarking
│   └── output/Benchmarking/ReadPrep
│       ├── output/Benchmarking/ReadPrep/analysis_tr_unsmpld_seqdata_SAMPLE.tsv
│       ├── output/Benchmarking/ReadPrep/kmertr_tr_SAMPLE_merged.tsv
│       ├── output/Benchmarking/ReadPrep/kmertr_tr_SAMPLE_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_SAMPLE_merged.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_SAMPLE_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_impact_SAMPLE_merged.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_impact_SAMPLE_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_seqdata_SAMPLE_tr_merged.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_seqdata_SAMPLE_tr_unmgd_R1.tsv
│       └── output/Benchmarking/ReadPrep/read_ana_report.tsv
├── output/Data
│   └── output/Data/ReadPrep
│       ├── output/Data/ReadPrep/AdapterTrimmed
│       │   └── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr.tsv
│       ├── output/Data/ReadPrep/AnalysisReads_tr
│       │   ├── output/Data/ReadPrep/AnalysisReads_tr/SAMPLE_analysis_tr_unsmpld_seqdat_overview.tsv
│       │   └── output/Data/ReadPrep/AnalysisReads_tr/SAMPLE_analysis_tr_unsmpld_seqdat_plotdata.tsv
│       └── output/Data/ReadPrep/Merging_tr
│           ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_insert_sizes.tsv
│           ├── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr_impact.tsv
│           ├── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr_seqdat_overview.tsv
│           ├── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr_seqdat_plotdata.tsv
│           └── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr.tsv
├── output/ReadPrep
│   ├── output/ReadPrep/AnalysisReads_tr
│   │   └── output/ReadPrep/AnalysisReads_tr/SAMPLE
│   │       ├── output/ReadPrep/AnalysisReads_tr/SAMPLE/SAMPLE_analysis_tr_unsmpld_concat.log
│   │       └── output/ReadPrep/AnalysisReads_tr/SAMPLE/SAMPLE_analysis_tr_unsmpld.fastq.gz
│   └── output/ReadPrep/Merging_tr
│       └── output/ReadPrep/Merging_tr/SAMPLE
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged_mask.log
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged_qtr_kmertr.fastq.gz
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged_qtr_kmertr.log
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1_mask.log
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1_qtr_kmertr.fastq.gz
│           └── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1_qtr_kmertr.log
└── output/Reports
    ├── output/Reports/4read-ana-prep.html
    └── output/Reports/FastQC
        ├── output/Reports/FastQC/class_tr_unsmpld_SAMPLE_fastqc.html
        ├── output/Reports/FastQC/qtr_tr_SAMPLE_merged_fastqc.html
        └── output/Reports/FastQC/qtr_tr_SAMPLE_unmgd_R1_fastqc.html

See the general file information for the Benchmarking, impact and seqdat files.

The SAMPLE_qtr_tr.tsv file is also the minimal report, but the reads are processed as unpaired reads, therefore there are no R2 values. BBmask converts the low complexity regions to unspecified base calls before cutadapt is called. Also included is data from the masking and k-mer trimmming process.

Read file
False positive rate estimation for k-mer trimming
Low abundance reads skipped for k-mer trimming
Low abundance base pairs skipped for k-mer trimming
Entire reads removed due to k-mer trimming
Number of reads k-mer trimmed
Base pairs remvoved by k-mer trimming
Base pairs masked by BBmask
Incomplete Adapter warning, almost always OK
Number of reads in input
Base pairs in input
Reads discarded for being too short
Reads discarded for being too long
Reads discarded for containing too many unspecified base calls
Reads written
Reads adapter trimmed, should be 0
Base pairs quality trimmed
Base pairs written

The input basepair and read count values come from cutadapt and therefore are based on the output of the k-mer trimming operation.