Read preparation

Read Preparation

Before analysing the reads, certain technical artifacts have to be removed. Depending on the input data source, sequencing adapters may need to be trimmed, overlapping reads merged and for some steps quality trimming necessary.

The steps for this stage are defined in the Snakefile_1PrepareReads file. No reference data is required.

Small metagenome: Tbd. GB
Medium metagenome: Tbd. GB
Large metagenome: Tbd. GB

All preparation steps can be run with the prepareReads and prepareReadsReport rules.

Outputs are stored in <outputdir>/ReadPrep/, <outputdir>/Data/ReadPrep/ and <outputdir>/Reports/.

Used tools are cutadapt, BBmerge & BBMask, khmer trim-low-abund.py and seqtk. R libraries for data processing include data.table, seqinr, plotly, ggplot, pbapply, flexdashboard and shortreads. Check the environment (envs/) files for more details.

Raw sequencing data
Adapter trimming
Read merging
Read quality trimming

Raw sequencing data

Sequence data files for all samples can be produced with the rule rawSeqData and FASTQC reports with rawFASTQC. A report on the user provided files can be created with rawReport.

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadPrep
│       ├── output/Benchmarking/ReadPrep/raw_fastqc_SAMPLE_R1.tsv
│       ├── output/Benchmarking/ReadPrep/raw_fastqc_SAMPLE_R2.tsv
│       ├── output/Benchmarking/ReadPrep/raw_report.tsv
│       ├── output/Benchmarking/ReadPrep/raw_seqdata_SAMPLE_R1.tsv
│       └── output/Benchmarking/ReadPrep/raw_seqdata_SAMPLE_R2.tsv
├── output/Data
│   └── output/Data/ReadPrep
│       └── output/Data/ReadPrep/RawSequenceData
│           ├── output/Data/ReadPrep/RawSequenceData/SAMPLE_raw_seqdat_overview.tsv
│           └── output/Data/ReadPrep/RawSequenceData/SAMPLE_raw_seqdat_plotdata.tsv
└── output/Reports
    ├── output/Reports/1raw-reads.html
    └── output/Reports/FastQC
        ├── output/Reports/FastQC/raw_SAMPLE_R1_fastqc.html
        ├── output/Reports/FastQC/raw_SAMPLE_R1_fastqc.zip
        ├── output/Reports/FastQC/raw_SAMPLE_R2_fastqc.html
        └── output/Reports/FastQC/raw_SAMPLE_R2_fastqc.zip

See the general file information for the Benchmarking and seqdat files.

Adapter trimming

Even if you disable adapter trimming in the configuration file, it is still run to generate data, that should show trimming wasn’t necessary. Trimming is performed with the trim rule. Sequence data files for the trimmed files are produced with trimSeqData and the FASTQC reports with trimFASTQC.

The trimming report (trimReport) uses data from the merge operation, therefore it is included in the next section on merging.

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadPrep
│       ├── output/Benchmarking/ReadPrep/trim_impact_SAMPLE_R1.tsv
│       ├── output/Benchmarking/ReadPrep/trim_impact_SAMPLE_R2.tsv
│       ├── output/Benchmarking/ReadPrep/trim_SAMPLE.tsv
│       ├── output/Benchmarking/ReadPrep/trim_seqdata_SAMPLE_R1.tsv
│       └── output/Benchmarking/ReadPrep/trim_seqdata_SAMPLE_R2.tsv
├── output/Data
│   └── output/Data/ReadPrep
│       └── output/Data/ReadPrep/AdapterTrimmed
│           ├── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr_impact.tsv
│           ├── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr_seqdat_overview.tsv
│           ├── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr_seqdat_plotdata.tsv
│           └── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr.tsv
├── output/ReadPrep
│   └── output/ReadPrep/AdapterTrimmed
│       └── output/ReadPrep/AdapterTrimmed/SAMPLE
│           ├── output/ReadPrep/AdapterTrimmed/SAMPLE/SAMPLE_R1_adptr_tr.fastq.gz
│           └── output/ReadPrep/AdapterTrimmed/SAMPLE/SAMPLE_R2_adptr_tr.fastq.gz
└── output/Reports
    └── output/Reports/FastQC
        ├── output/Reports/FastQC/trim_SAMPLE_R1_fastqc.html
        └── output/Reports/FastQC/trim_SAMPLE_R2_fastqc.html

See the general file information for the Benchmarking, impact and seqdat files.

The SAMPLE_adptr_tr.tsv file is the minimal report produced by cutadapt.

status: Incomplete Adapter warning, almost always OK
in_reads: Number of reads in input
in_bp: Base pairs (R1+R2) in input
too_short: Reads discarded for being too short
too_long: Reads discarded for being too long
too_many_n: Reads discarded for containing too many unspecified base calls
out_reads: Reads written
w/adapters: R1 reads adapter trimmed
qualtrim_bp: R1 base pairs quality trimmed
out_bp: R1 base pairs written
w/adapters2: R2 reads adapter trimmed
qualtrim2_bp: R2 base pairs quality trimmed
out2_bp: R2 base pairs written

Read merging

As the R1 and R2 reads from paired end sequencing may overlap, read merging is performed with BBmerge. Merging for all samples can be performed with merge. Sequence data is generated with mergeSeqData and FASTQC is run with mergeFASTQC. The trim report (trimReport) and merge report (mergeReport) both always use untrimmed and trimmed merging results for analysis.

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadPrep
│       ├── output/Benchmarking/ReadPrep/merge_report.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_tr_merged.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_tr_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_tr_unmgd_R2.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_untr_merged.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_untr_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_untr_unmgd_R2.tsv
│       ├── output/Benchmarking/ReadPrep/merging_tr_SAMPLE.tsv
│       └── output/Benchmarking/ReadPrep/merging_untr_SAMPLE.tsv
├── output/Data
│   └── output/Data/ReadPrep
│       ├── output/Data/ReadPrep/Merging_tr
│       │   ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_insert_sizes.tsv
│       │   ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_overview.tsv
│       │   ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_seqdat_overview.tsv
│       │   └── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_seqdat_plotdata.tsv
│       └── output/Data/ReadPrep/Merging_untr
│           ├── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_insert_sizes.tsv
│           ├── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_overview.tsv
│           ├── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_seqdat_overview.tsv
│           └── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_seqdat_plotdata.tsv
├── output/ReadPrep
│   ├── output/ReadPrep/Merging_tr
│   │   └── output/ReadPrep/Merging_tr/SAMPLE
│   │       ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_adapters.fa
│   │       ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged.fastq.gz
│   │       ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merge.log
│   │       ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1.fastq.gz
│   │       └── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R2.fastq.gz
│   └── output/ReadPrep/Merging_untr
│       └── output/ReadPrep/Merging_untr/SAMPLE
│           ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_adapters.fa
│           ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_merged.fastq.gz
│           ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_merge.log
│           ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_unmgd_R1.fastq.gz
│           └── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_unmgd_R2.fastq.gz
└── output/Reports
    ├── output/Reports/2trimming.html
    ├── output/Reports/3read-merging.html
    └── output/Reports/FastQC
        ├── output/Reports/FastQC/merge_tr_SAMPLE_merged_fastqc.html
        ├── output/Reports/FastQC/merge_tr_SAMPLE_unmgd_R1_fastqc.html
        ├── output/Reports/FastQC/merge_tr_SAMPLE_unmgd_R2_fastqc.html
        ├── output/Reports/FastQC/merge_untr_SAMPLE_merged_fastqc.html
        ├── output/Reports/FastQC/merge_untr_SAMPLE_unmgd_R1_fastqc.html
        └── output/Reports/FastQC/merge_untr_SAMPLE_unmgd_R2_fastqc.html

See the general file information for the Benchmarking and seqdat files.

The SAMPLE_merge_{untr/tr}_insert_sizes.tsv files are directly produced by BBmerge. One line represents one read pair.

#id: Sequence identifier as seen in the FASTQ file
numericID: Positional index of the sequence
insert: Length of the new sequence resulting from the merge, -1 for failed merges
status: Character showing the merging success, can be F(ailed), I(mperfect) or P(erfect)
mismatches: The number of mismatches during merging. 0 for perfect merges and missing for failed merges

For the SAMPLE_merge_{untr/tr}_overview.tsv file the BBmerge log output is parsed. The adapter output can be used to make sure that adapter trimming was successful. This is done by making sure the adaptercount field has a low value (compared to the joined/merged count).

pairstomerge: Read pairs read from the FASTQ files
joined: Number of pairs merged
ambigous: Not merged pairs, because multiple different solutions were likely
nosolution: Not mergeable pairs
tooshort: Rejected pairs, because the resulting sequence would be too short
shortestinsert: Shortest sequence length produced by merging
medianinsert: Median sequence length produced by merging
longestinsert: Longest sequence length produced by merging
adaptercount: Number of merged reads where a adapter read-through is expected
Read1_adapter: Conensus sequence of the suspected adapter sequence for R1
Read2_adapter: Conensus sequence of the suspected adapter sequence for R2

Read quality trimming

For some steps like read classification and direct database searches it makes sense to mask low complexity and low quality regions of the reads, as not all tools are quality aware. The merged and unmerged R1 reads are combined for these steps, because read pairings is another information some tools can’t process and R1+R2 hits on the same subject may not be correctlly accounted for. K-mer based error trimming using trim-low-abund.py from khmer is performed before masking and quality scored based trimming.

If samples with different depths are compared using read analysis methods, sampling (rarefaction) of the reads can compensate for this difference. This sampling is configurable in the config file.

The quality trimming (and optional sampling) can be performed with the analysisReads, the sequence data files are produced with analysisReadsSeqData and the FASTQC reports with analysisReadsFASTQC. The report can be generated with analysisReadsReport.

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadPrep
│       ├── output/Benchmarking/ReadPrep/analysis_tr_unsmpld_seqdata_SAMPLE.tsv
│       ├── output/Benchmarking/ReadPrep/kmertr_tr_SAMPLE_merged.tsv
│       ├── output/Benchmarking/ReadPrep/kmertr_tr_SAMPLE_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_SAMPLE_merged.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_SAMPLE_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_impact_SAMPLE_merged.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_impact_SAMPLE_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_seqdata_SAMPLE_tr_merged.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_seqdata_SAMPLE_tr_unmgd_R1.tsv
│       └── output/Benchmarking/ReadPrep/read_ana_report.tsv
├── output/Data
│   └── output/Data/ReadPrep
│       ├── output/Data/ReadPrep/AdapterTrimmed
│       │   └── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr.tsv
│       ├── output/Data/ReadPrep/AnalysisReads_tr
│       │   ├── output/Data/ReadPrep/AnalysisReads_tr/SAMPLE_analysis_tr_unsmpld_seqdat_overview.tsv
│       │   └── output/Data/ReadPrep/AnalysisReads_tr/SAMPLE_analysis_tr_unsmpld_seqdat_plotdata.tsv
│       └── output/Data/ReadPrep/Merging_tr
│           ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_insert_sizes.tsv
│           ├── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr_impact.tsv
│           ├── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr_seqdat_overview.tsv
│           ├── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr_seqdat_plotdata.tsv
│           └── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr.tsv
├── output/ReadPrep
│   ├── output/ReadPrep/AnalysisReads_tr
│   │   └── output/ReadPrep/AnalysisReads_tr/SAMPLE
│   │       ├── output/ReadPrep/AnalysisReads_tr/SAMPLE/SAMPLE_analysis_tr_unsmpld_concat.log
│   │       └── output/ReadPrep/AnalysisReads_tr/SAMPLE/SAMPLE_analysis_tr_unsmpld.fastq.gz
│   └── output/ReadPrep/Merging_tr
│       └── output/ReadPrep/Merging_tr/SAMPLE
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged_mask.log
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged_qtr_kmertr.fastq.gz
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged_qtr_kmertr.log
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1_mask.log
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1_qtr_kmertr.fastq.gz
│           └── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1_qtr_kmertr.log
└── output/Reports
    ├── output/Reports/4read-ana-prep.html
    └── output/Reports/FastQC
        ├── output/Reports/FastQC/class_tr_unsmpld_SAMPLE_fastqc.html
        ├── output/Reports/FastQC/qtr_tr_SAMPLE_merged_fastqc.html
        └── output/Reports/FastQC/qtr_tr_SAMPLE_unmgd_R1_fastqc.html

See the general file information for the Benchmarking, impact and seqdat files.

The SAMPLE_qtr_tr.tsv file is also the minimal report, but the reads are processed as unpaired reads, therefore there are no R2 values. BBmask converts the low complexity regions to unspecified base calls before cutadapt is called. Also included is data from the masking and k-mer trimmming process.

read: Read file
fpr: False positive rate estimation for k-mer trimming
reads_skipped: Low abundance reads skipped for k-mer trimming
basepairs_skipped: Low abundance base pairs skipped for k-mer trimming
reads_removed: Entire reads removed due to k-mer trimming
reads_trimmed: Number of reads k-mer trimmed
basepairs_removed_or_trimmed: Base pairs remvoved by k-mer trimming
total_masked: Base pairs masked by BBmask
status: Incomplete Adapter warning, almost always OK
in_reads: Number of reads in input
in_bp: Base pairs in input
too_short: Reads discarded for being too short
too_long: Reads discarded for being too long
too_many_n: Reads discarded for containing too many unspecified base calls
out_reads: Reads written
w/adapters: Reads adapter trimmed, should be 0
qualtrim_bp: Base pairs quality trimmed
out_bp: Base pairs written

The input basepair and read count values come from cutadapt and therefore are based on the output of the k-mer trimming operation.