Link

Read Preparation

Before analysing the reads, certain technical artifacts have to be removed. Depending on the input data source, sequencing adapters may need to be trimmed, overlapping reads merged and for some steps quality trimming necessary.

The steps for this stage are defined in the Snakefile_1PrepareReads file. No reference data is required.

Small metagenome
Tbd. GB
Medium metagenome
Tbd. GB
Large metagenome
Tbd. GB

All preparation steps can be run with the prepareReads and prepareReadsReport rules.

Outputs are stored in <outputdir>/ReadPrep/, <outputdir>/Data/ReadPrep/ and <outputdir>/Reports/.

Used tools are cutadapt, BBmerge & BBMask, khmer trim-low-abund.py and seqtk. R libraries for data processing include data.table, seqinr, plotly, ggplot, pbapply, flexdashboard and shortreads. Check the environment (envs/) files for more details.


Table of contents

  1. Raw sequencing data
  2. Adapter trimming
  3. Read merging
  4. Read quality trimming

Raw sequencing data

Sequence data files for all samples can be produced with the rule rawSeqData and FASTQC reports with rawFASTQC. A report on the user provided files can be created with rawReport.

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadPrep
│       ├── output/Benchmarking/ReadPrep/raw_fastqc_SAMPLE_R1.tsv
│       ├── output/Benchmarking/ReadPrep/raw_fastqc_SAMPLE_R2.tsv
│       ├── output/Benchmarking/ReadPrep/raw_report.tsv
│       ├── output/Benchmarking/ReadPrep/raw_seqdata_SAMPLE_R1.tsv
│       └── output/Benchmarking/ReadPrep/raw_seqdata_SAMPLE_R2.tsv
├── output/Data
│   └── output/Data/ReadPrep
│       └── output/Data/ReadPrep/RawSequenceData
│           ├── output/Data/ReadPrep/RawSequenceData/SAMPLE_raw_seqdat_overview.tsv
│           └── output/Data/ReadPrep/RawSequenceData/SAMPLE_raw_seqdat_plotdata.tsv
└── output/Reports
    ├── output/Reports/1raw-reads.html
    └── output/Reports/FastQC
        ├── output/Reports/FastQC/raw_SAMPLE_R1_fastqc.html
        ├── output/Reports/FastQC/raw_SAMPLE_R1_fastqc.zip
        ├── output/Reports/FastQC/raw_SAMPLE_R2_fastqc.html
        └── output/Reports/FastQC/raw_SAMPLE_R2_fastqc.zip

See the general file information for the Benchmarking and seqdat files.


Adapter trimming

Even if you disable adapter trimming in the configuration file, it is still run to generate data, that should show trimming wasn’t necessary. Trimming is performed with the trim rule. Sequence data files for the trimmed files are produced with trimSeqData and the FASTQC reports with trimFASTQC.

The trimming report (trimReport) uses data from the merge operation, therefore it is included in the next section on merging.

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadPrep
│       ├── output/Benchmarking/ReadPrep/trim_impact_SAMPLE_R1.tsv
│       ├── output/Benchmarking/ReadPrep/trim_impact_SAMPLE_R2.tsv
│       ├── output/Benchmarking/ReadPrep/trim_SAMPLE.tsv
│       ├── output/Benchmarking/ReadPrep/trim_seqdata_SAMPLE_R1.tsv
│       └── output/Benchmarking/ReadPrep/trim_seqdata_SAMPLE_R2.tsv
├── output/Data
│   └── output/Data/ReadPrep
│       └── output/Data/ReadPrep/AdapterTrimmed
│           ├── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr_impact.tsv
│           ├── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr_seqdat_overview.tsv
│           ├── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr_seqdat_plotdata.tsv
│           └── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr.tsv
├── output/ReadPrep
│   └── output/ReadPrep/AdapterTrimmed
│       └── output/ReadPrep/AdapterTrimmed/SAMPLE
│           ├── output/ReadPrep/AdapterTrimmed/SAMPLE/SAMPLE_R1_adptr_tr.fastq.gz
│           └── output/ReadPrep/AdapterTrimmed/SAMPLE/SAMPLE_R2_adptr_tr.fastq.gz
└── output/Reports
    └── output/Reports/FastQC
        ├── output/Reports/FastQC/trim_SAMPLE_R1_fastqc.html
        └── output/Reports/FastQC/trim_SAMPLE_R2_fastqc.html

See the general file information for the Benchmarking, impact and seqdat files.

The SAMPLE_adptr_tr.tsv file is the minimal report produced by cutadapt.

status
Incomplete Adapter warning, almost always OK
in_reads
Number of reads in input
in_bp
Base pairs (R1+R2) in input
too_short
Reads discarded for being too short
too_long
Reads discarded for being too long
too_many_n
Reads discarded for containing too many unspecified base calls
out_reads
Reads written
w/adapters
R1 reads adapter trimmed
qualtrim_bp
R1 base pairs quality trimmed
out_bp
R1 base pairs written
w/adapters2
R2 reads adapter trimmed
qualtrim2_bp
R2 base pairs quality trimmed
out2_bp
R2 base pairs written

Read merging

As the R1 and R2 reads from paired end sequencing may overlap, read merging is performed with BBmerge. Merging for all samples can be performed with merge. Sequence data is generated with mergeSeqData and FASTQC is run with mergeFASTQC. The trim report (trimReport) and merge report (mergeReport) both always use untrimmed and trimmed merging results for analysis.

Output files

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadPrep
│       ├── output/Benchmarking/ReadPrep/merge_report.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_tr_merged.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_tr_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_tr_unmgd_R2.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_untr_merged.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_untr_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_untr_unmgd_R2.tsv
│       ├── output/Benchmarking/ReadPrep/merging_tr_SAMPLE.tsv
│       └── output/Benchmarking/ReadPrep/merging_untr_SAMPLE.tsv
├── output/Data
│   └── output/Data/ReadPrep
│       ├── output/Data/ReadPrep/Merging_tr
│       │   ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_insert_sizes.tsv
│       │   ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_overview.tsv
│       │   ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_seqdat_overview.tsv
│       │   └── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_seqdat_plotdata.tsv
│       └── output/Data/ReadPrep/Merging_untr
│           ├── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_insert_sizes.tsv
│           ├── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_overview.tsv
│           ├── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_seqdat_overview.tsv
│           └── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_seqdat_plotdata.tsv
├── output/ReadPrep
│   ├── output/ReadPrep/Merging_tr
│   │   └── output/ReadPrep/Merging_tr/SAMPLE
│   │       ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_adapters.fa
│   │       ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged.fastq.gz
│   │       ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merge.log
│   │       ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1.fastq.gz
│   │       └── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R2.fastq.gz
│   └── output/ReadPrep/Merging_untr
│       └── output/ReadPrep/Merging_untr/SAMPLE
│           ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_adapters.fa
│           ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_merged.fastq.gz
│           ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_merge.log
│           ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_unmgd_R1.fastq.gz
│           └── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_unmgd_R2.fastq.gz
└── output/Reports
    ├── output/Reports/2trimming.html
    ├── output/Reports/3read-merging.html
    └── output/Reports/FastQC
        ├── output/Reports/FastQC/merge_tr_SAMPLE_merged_fastqc.html
        ├── output/Reports/FastQC/merge_tr_SAMPLE_unmgd_R1_fastqc.html
        ├── output/Reports/FastQC/merge_tr_SAMPLE_unmgd_R2_fastqc.html
        ├── output/Reports/FastQC/merge_untr_SAMPLE_merged_fastqc.html
        ├── output/Reports/FastQC/merge_untr_SAMPLE_unmgd_R1_fastqc.html
        └── output/Reports/FastQC/merge_untr_SAMPLE_unmgd_R2_fastqc.html

See the general file information for the Benchmarking and seqdat files.

The SAMPLE_merge_{untr/tr}_insert_sizes.tsv files are directly produced by BBmerge. One line represents one read pair.

#id
Sequence identifier as seen in the FASTQ file
numericID
Positional index of the sequence
insert
Length of the new sequence resulting from the merge, -1 for failed merges
status
Character showing the merging success, can be F(ailed), I(mperfect) or P(erfect)
mismatches
The number of mismatches during merging. 0 for perfect merges and missing for failed merges

For the SAMPLE_merge_{untr/tr}_overview.tsv file the BBmerge log output is parsed. The adapter output can be used to make sure that adapter trimming was successful. This is done by making sure the adaptercount field has a low value (compared to the joined/merged count).

pairstomerge
Read pairs read from the FASTQ files
joined
Number of pairs merged
ambigous
Not merged pairs, because multiple different solutions were likely
nosolution
Not mergeable pairs
tooshort
Rejected pairs, because the resulting sequence would be too short
shortestinsert
Shortest sequence length produced by merging
medianinsert
Median sequence length produced by merging
longestinsert
Longest sequence length produced by merging
adaptercount
Number of merged reads where a adapter read-through is expected
Read1_adapter
Conensus sequence of the suspected adapter sequence for R1
Read2_adapter
Conensus sequence of the suspected adapter sequence for R2

Read quality trimming

For some steps like read classification and direct database searches it makes sense to mask low complexity and low quality regions of the reads, as not all tools are quality aware. The merged and unmerged R1 reads are combined for these steps, because read pairings is another information some tools can’t process and R1+R2 hits on the same subject may not be correctlly accounted for. K-mer based error trimming using trim-low-abund.py from khmer is performed before masking and quality scored based trimming.

If samples with different depths are compared using read analysis methods, sampling (rarefaction) of the reads can compensate for this difference. This sampling is configurable in the config file.

The quality trimming (and optional sampling) can be performed with the analysisReads, the sequence data files are produced with analysisReadsSeqData and the FASTQC reports with analysisReadsFASTQC. The report can be generated with analysisReadsReport.

output
├── output/Benchmarking
│   └── output/Benchmarking/ReadPrep
│       ├── output/Benchmarking/ReadPrep/analysis_tr_unsmpld_seqdata_SAMPLE.tsv
│       ├── output/Benchmarking/ReadPrep/kmertr_tr_SAMPLE_merged.tsv
│       ├── output/Benchmarking/ReadPrep/kmertr_tr_SAMPLE_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_SAMPLE_merged.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_SAMPLE_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_impact_SAMPLE_merged.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_impact_SAMPLE_unmgd_R1.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_seqdata_SAMPLE_tr_merged.tsv
│       ├── output/Benchmarking/ReadPrep/qtr_tr_seqdata_SAMPLE_tr_unmgd_R1.tsv
│       └── output/Benchmarking/ReadPrep/read_ana_report.tsv
├── output/Data
│   └── output/Data/ReadPrep
│       ├── output/Data/ReadPrep/AdapterTrimmed
│       │   └── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr.tsv
│       ├── output/Data/ReadPrep/AnalysisReads_tr
│       │   ├── output/Data/ReadPrep/AnalysisReads_tr/SAMPLE_analysis_tr_unsmpld_seqdat_overview.tsv
│       │   └── output/Data/ReadPrep/AnalysisReads_tr/SAMPLE_analysis_tr_unsmpld_seqdat_plotdata.tsv
│       └── output/Data/ReadPrep/Merging_tr
│           ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_insert_sizes.tsv
│           ├── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr_impact.tsv
│           ├── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr_seqdat_overview.tsv
│           ├── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr_seqdat_plotdata.tsv
│           └── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr.tsv
├── output/ReadPrep
│   ├── output/ReadPrep/AnalysisReads_tr
│   │   └── output/ReadPrep/AnalysisReads_tr/SAMPLE
│   │       ├── output/ReadPrep/AnalysisReads_tr/SAMPLE/SAMPLE_analysis_tr_unsmpld_concat.log
│   │       └── output/ReadPrep/AnalysisReads_tr/SAMPLE/SAMPLE_analysis_tr_unsmpld.fastq.gz
│   └── output/ReadPrep/Merging_tr
│       └── output/ReadPrep/Merging_tr/SAMPLE
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged_mask.log
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged_qtr_kmertr.fastq.gz
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged_qtr_kmertr.log
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1_mask.log
│           ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1_qtr_kmertr.fastq.gz
│           └── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1_qtr_kmertr.log
└── output/Reports
    ├── output/Reports/4read-ana-prep.html
    └── output/Reports/FastQC
        ├── output/Reports/FastQC/class_tr_unsmpld_SAMPLE_fastqc.html
        ├── output/Reports/FastQC/qtr_tr_SAMPLE_merged_fastqc.html
        └── output/Reports/FastQC/qtr_tr_SAMPLE_unmgd_R1_fastqc.html

See the general file information for the Benchmarking, impact and seqdat files.

The SAMPLE_qtr_tr.tsv file is also the minimal report, but the reads are processed as unpaired reads, therefore there are no R2 values. BBmask converts the low complexity regions to unspecified base calls before cutadapt is called. Also included is data from the masking and k-mer trimmming process.

read
Read file
fpr
False positive rate estimation for k-mer trimming
reads_skipped
Low abundance reads skipped for k-mer trimming
basepairs_skipped
Low abundance base pairs skipped for k-mer trimming
reads_removed
Entire reads removed due to k-mer trimming
reads_trimmed
Number of reads k-mer trimmed
basepairs_removed_or_trimmed
Base pairs remvoved by k-mer trimming
total_masked
Base pairs masked by BBmask
status
Incomplete Adapter warning, almost always OK
in_reads
Number of reads in input
in_bp
Base pairs in input
too_short
Reads discarded for being too short
too_long
Reads discarded for being too long
too_many_n
Reads discarded for containing too many unspecified base calls
out_reads
Reads written
w/adapters
Reads adapter trimmed, should be 0
qualtrim_bp
Base pairs quality trimmed
out_bp
Base pairs written

The input basepair and read count values come from cutadapt and therefore are based on the output of the k-mer trimming operation.