Read Preparation
Before analysing the reads, certain technical artifacts have to be removed. Depending on the input data source, sequencing adapters may need to be trimmed, overlapping reads merged and for some steps quality trimming necessary.
The steps for this stage are defined in the Snakefile_1PrepareReads
file. No reference data is required.
- Small metagenome
- Tbd. GB
- Medium metagenome
- Tbd. GB
- Large metagenome
- Tbd. GB
All preparation steps can be run with the prepareReads
and prepareReadsReport
rules.
Outputs are stored in <outputdir>/ReadPrep/
, <outputdir>/Data/ReadPrep/
and <outputdir>/Reports/
.
Used tools are cutadapt, BBmerge & BBMask, khmer trim-low-abund.py and seqtk. R libraries for data processing include data.table, seqinr, plotly, ggplot, pbapply, flexdashboard and shortreads. Check the environment (envs/
) files for more details.
Table of contents
Raw sequencing data
Sequence data files for all samples can be produced with the rule rawSeqData
and FASTQC reports with rawFASTQC
. A report on the user provided files can be created with rawReport
.
Output files
output
├── output/Benchmarking
│ └── output/Benchmarking/ReadPrep
│ ├── output/Benchmarking/ReadPrep/raw_fastqc_SAMPLE_R1.tsv
│ ├── output/Benchmarking/ReadPrep/raw_fastqc_SAMPLE_R2.tsv
│ ├── output/Benchmarking/ReadPrep/raw_report.tsv
│ ├── output/Benchmarking/ReadPrep/raw_seqdata_SAMPLE_R1.tsv
│ └── output/Benchmarking/ReadPrep/raw_seqdata_SAMPLE_R2.tsv
├── output/Data
│ └── output/Data/ReadPrep
│ └── output/Data/ReadPrep/RawSequenceData
│ ├── output/Data/ReadPrep/RawSequenceData/SAMPLE_raw_seqdat_overview.tsv
│ └── output/Data/ReadPrep/RawSequenceData/SAMPLE_raw_seqdat_plotdata.tsv
└── output/Reports
├── output/Reports/1raw-reads.html
└── output/Reports/FastQC
├── output/Reports/FastQC/raw_SAMPLE_R1_fastqc.html
├── output/Reports/FastQC/raw_SAMPLE_R1_fastqc.zip
├── output/Reports/FastQC/raw_SAMPLE_R2_fastqc.html
└── output/Reports/FastQC/raw_SAMPLE_R2_fastqc.zip
See the general file information for the Benchmarking
and seqdat
files.
Adapter trimming
Even if you disable adapter trimming in the configuration file, it is still run to generate data, that should show trimming wasn’t necessary. Trimming is performed with the trim
rule. Sequence data files for the trimmed files are produced with trimSeqData
and the FASTQC reports with trimFASTQC
.
The trimming report (trimReport
) uses data from the merge operation, therefore it is included in the next section on merging.
Output files
output
├── output/Benchmarking
│ └── output/Benchmarking/ReadPrep
│ ├── output/Benchmarking/ReadPrep/trim_impact_SAMPLE_R1.tsv
│ ├── output/Benchmarking/ReadPrep/trim_impact_SAMPLE_R2.tsv
│ ├── output/Benchmarking/ReadPrep/trim_SAMPLE.tsv
│ ├── output/Benchmarking/ReadPrep/trim_seqdata_SAMPLE_R1.tsv
│ └── output/Benchmarking/ReadPrep/trim_seqdata_SAMPLE_R2.tsv
├── output/Data
│ └── output/Data/ReadPrep
│ └── output/Data/ReadPrep/AdapterTrimmed
│ ├── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr_impact.tsv
│ ├── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr_seqdat_overview.tsv
│ ├── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr_seqdat_plotdata.tsv
│ └── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr.tsv
├── output/ReadPrep
│ └── output/ReadPrep/AdapterTrimmed
│ └── output/ReadPrep/AdapterTrimmed/SAMPLE
│ ├── output/ReadPrep/AdapterTrimmed/SAMPLE/SAMPLE_R1_adptr_tr.fastq.gz
│ └── output/ReadPrep/AdapterTrimmed/SAMPLE/SAMPLE_R2_adptr_tr.fastq.gz
└── output/Reports
└── output/Reports/FastQC
├── output/Reports/FastQC/trim_SAMPLE_R1_fastqc.html
└── output/Reports/FastQC/trim_SAMPLE_R2_fastqc.html
See the general file information for the Benchmarking
, impact
and seqdat
files.
The SAMPLE_adptr_tr.tsv
file is the minimal report produced by cutadapt.
- status
- Incomplete Adapter warning, almost always OK
- in_reads
- Number of reads in input
- in_bp
- Base pairs (R1+R2) in input
- too_short
- Reads discarded for being too short
- too_long
- Reads discarded for being too long
- too_many_n
- Reads discarded for containing too many unspecified base calls
- out_reads
- Reads written
- w/adapters
- R1 reads adapter trimmed
- qualtrim_bp
- R1 base pairs quality trimmed
- out_bp
- R1 base pairs written
- w/adapters2
- R2 reads adapter trimmed
- qualtrim2_bp
- R2 base pairs quality trimmed
- out2_bp
- R2 base pairs written
Read merging
As the R1 and R2 reads from paired end sequencing may overlap, read merging is performed with BBmerge. Merging for all samples can be performed with merge
. Sequence data is generated with mergeSeqData
and FASTQC is run with mergeFASTQC
. The trim report (trimReport
) and merge report (mergeReport
) both always use untrimmed and trimmed merging results for analysis.
Output files
output
├── output/Benchmarking
│ └── output/Benchmarking/ReadPrep
│ ├── output/Benchmarking/ReadPrep/merge_report.tsv
│ ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_tr_merged.tsv
│ ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_tr_unmgd_R1.tsv
│ ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_tr_unmgd_R2.tsv
│ ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_untr_merged.tsv
│ ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_untr_unmgd_R1.tsv
│ ├── output/Benchmarking/ReadPrep/merge_seqdata_SAMPLE_untr_unmgd_R2.tsv
│ ├── output/Benchmarking/ReadPrep/merging_tr_SAMPLE.tsv
│ └── output/Benchmarking/ReadPrep/merging_untr_SAMPLE.tsv
├── output/Data
│ └── output/Data/ReadPrep
│ ├── output/Data/ReadPrep/Merging_tr
│ │ ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_insert_sizes.tsv
│ │ ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_overview.tsv
│ │ ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_seqdat_overview.tsv
│ │ └── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_seqdat_plotdata.tsv
│ └── output/Data/ReadPrep/Merging_untr
│ ├── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_insert_sizes.tsv
│ ├── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_overview.tsv
│ ├── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_seqdat_overview.tsv
│ └── output/Data/ReadPrep/Merging_untr/SAMPLE_merge_untr_seqdat_plotdata.tsv
├── output/ReadPrep
│ ├── output/ReadPrep/Merging_tr
│ │ └── output/ReadPrep/Merging_tr/SAMPLE
│ │ ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_adapters.fa
│ │ ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged.fastq.gz
│ │ ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merge.log
│ │ ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1.fastq.gz
│ │ └── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R2.fastq.gz
│ └── output/ReadPrep/Merging_untr
│ └── output/ReadPrep/Merging_untr/SAMPLE
│ ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_adapters.fa
│ ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_merged.fastq.gz
│ ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_merge.log
│ ├── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_unmgd_R1.fastq.gz
│ └── output/ReadPrep/Merging_untr/SAMPLE/SAMPLE_merge_untr_unmgd_R2.fastq.gz
└── output/Reports
├── output/Reports/2trimming.html
├── output/Reports/3read-merging.html
└── output/Reports/FastQC
├── output/Reports/FastQC/merge_tr_SAMPLE_merged_fastqc.html
├── output/Reports/FastQC/merge_tr_SAMPLE_unmgd_R1_fastqc.html
├── output/Reports/FastQC/merge_tr_SAMPLE_unmgd_R2_fastqc.html
├── output/Reports/FastQC/merge_untr_SAMPLE_merged_fastqc.html
├── output/Reports/FastQC/merge_untr_SAMPLE_unmgd_R1_fastqc.html
└── output/Reports/FastQC/merge_untr_SAMPLE_unmgd_R2_fastqc.html
See the general file information for the Benchmarking
and seqdat
files.
The SAMPLE_merge_{untr/tr}_insert_sizes.tsv
files are directly produced by BBmerge. One line represents one read pair.
- #id
- Sequence identifier as seen in the FASTQ file
- numericID
- Positional index of the sequence
- insert
- Length of the new sequence resulting from the merge, -1 for failed merges
- status
- Character showing the merging success, can be F(ailed), I(mperfect) or P(erfect)
- mismatches
- The number of mismatches during merging. 0 for perfect merges and missing for failed merges
For the SAMPLE_merge_{untr/tr}_overview.tsv
file the BBmerge log output is parsed. The adapter output can be used to make sure that adapter trimming was successful. This is done by making sure the adaptercount
field has a low value (compared to the joined/merged count).
- pairstomerge
- Read pairs read from the FASTQ files
- joined
- Number of pairs merged
- ambigous
- Not merged pairs, because multiple different solutions were likely
- nosolution
- Not mergeable pairs
- tooshort
- Rejected pairs, because the resulting sequence would be too short
- shortestinsert
- Shortest sequence length produced by merging
- medianinsert
- Median sequence length produced by merging
- longestinsert
- Longest sequence length produced by merging
- adaptercount
- Number of merged reads where a adapter read-through is expected
- Read1_adapter
- Conensus sequence of the suspected adapter sequence for R1
- Read2_adapter
- Conensus sequence of the suspected adapter sequence for R2
Read quality trimming
For some steps like read classification and direct database searches it makes sense to mask low complexity and low quality regions of the reads, as not all tools are quality aware. The merged and unmerged R1 reads are combined for these steps, because read pairings is another information some tools can’t process and R1+R2 hits on the same subject may not be correctlly accounted for. K-mer based error trimming using trim-low-abund.py
from khmer is performed before masking and quality scored based trimming.
If samples with different depths are compared using read analysis methods, sampling (rarefaction) of the reads can compensate for this difference. This sampling is configurable in the config file.
The quality trimming (and optional sampling) can be performed with the analysisReads
, the sequence data files are produced with analysisReadsSeqData
and the FASTQC reports with analysisReadsFASTQC
. The report can be generated with analysisReadsReport
.
output
├── output/Benchmarking
│ └── output/Benchmarking/ReadPrep
│ ├── output/Benchmarking/ReadPrep/analysis_tr_unsmpld_seqdata_SAMPLE.tsv
│ ├── output/Benchmarking/ReadPrep/kmertr_tr_SAMPLE_merged.tsv
│ ├── output/Benchmarking/ReadPrep/kmertr_tr_SAMPLE_unmgd_R1.tsv
│ ├── output/Benchmarking/ReadPrep/qtr_tr_SAMPLE_merged.tsv
│ ├── output/Benchmarking/ReadPrep/qtr_tr_SAMPLE_unmgd_R1.tsv
│ ├── output/Benchmarking/ReadPrep/qtr_tr_impact_SAMPLE_merged.tsv
│ ├── output/Benchmarking/ReadPrep/qtr_tr_impact_SAMPLE_unmgd_R1.tsv
│ ├── output/Benchmarking/ReadPrep/qtr_tr_seqdata_SAMPLE_tr_merged.tsv
│ ├── output/Benchmarking/ReadPrep/qtr_tr_seqdata_SAMPLE_tr_unmgd_R1.tsv
│ └── output/Benchmarking/ReadPrep/read_ana_report.tsv
├── output/Data
│ └── output/Data/ReadPrep
│ ├── output/Data/ReadPrep/AdapterTrimmed
│ │ └── output/Data/ReadPrep/AdapterTrimmed/SAMPLE_adptr_tr.tsv
│ ├── output/Data/ReadPrep/AnalysisReads_tr
│ │ ├── output/Data/ReadPrep/AnalysisReads_tr/SAMPLE_analysis_tr_unsmpld_seqdat_overview.tsv
│ │ └── output/Data/ReadPrep/AnalysisReads_tr/SAMPLE_analysis_tr_unsmpld_seqdat_plotdata.tsv
│ └── output/Data/ReadPrep/Merging_tr
│ ├── output/Data/ReadPrep/Merging_tr/SAMPLE_merge_tr_insert_sizes.tsv
│ ├── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr_impact.tsv
│ ├── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr_seqdat_overview.tsv
│ ├── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr_seqdat_plotdata.tsv
│ └── output/Data/ReadPrep/Merging_tr/SAMPLE_qtr_tr.tsv
├── output/ReadPrep
│ ├── output/ReadPrep/AnalysisReads_tr
│ │ └── output/ReadPrep/AnalysisReads_tr/SAMPLE
│ │ ├── output/ReadPrep/AnalysisReads_tr/SAMPLE/SAMPLE_analysis_tr_unsmpld_concat.log
│ │ └── output/ReadPrep/AnalysisReads_tr/SAMPLE/SAMPLE_analysis_tr_unsmpld.fastq.gz
│ └── output/ReadPrep/Merging_tr
│ └── output/ReadPrep/Merging_tr/SAMPLE
│ ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged_mask.log
│ ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged_qtr_kmertr.fastq.gz
│ ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_merged_qtr_kmertr.log
│ ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1_mask.log
│ ├── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1_qtr_kmertr.fastq.gz
│ └── output/ReadPrep/Merging_tr/SAMPLE/SAMPLE_merge_tr_unmgd_R1_qtr_kmertr.log
└── output/Reports
├── output/Reports/4read-ana-prep.html
└── output/Reports/FastQC
├── output/Reports/FastQC/class_tr_unsmpld_SAMPLE_fastqc.html
├── output/Reports/FastQC/qtr_tr_SAMPLE_merged_fastqc.html
└── output/Reports/FastQC/qtr_tr_SAMPLE_unmgd_R1_fastqc.html
See the general file information for the Benchmarking
, impact
and seqdat
files.
The SAMPLE_qtr_tr.tsv
file is also the minimal report, but the reads are processed as unpaired reads, therefore there are no R2 values. BBmask converts the low complexity regions to unspecified base calls before cutadapt is called. Also included is data from the masking and k-mer trimmming process.
- read
- Read file
- fpr
- False positive rate estimation for k-mer trimming
- reads_skipped
- Low abundance reads skipped for k-mer trimming
- basepairs_skipped
- Low abundance base pairs skipped for k-mer trimming
- reads_removed
- Entire reads removed due to k-mer trimming
- reads_trimmed
- Number of reads k-mer trimmed
- basepairs_removed_or_trimmed
- Base pairs remvoved by k-mer trimming
- total_masked
- Base pairs masked by BBmask
- status
- Incomplete Adapter warning, almost always OK
- in_reads
- Number of reads in input
- in_bp
- Base pairs in input
- too_short
- Reads discarded for being too short
- too_long
- Reads discarded for being too long
- too_many_n
- Reads discarded for containing too many unspecified base calls
- out_reads
- Reads written
- w/adapters
- Reads adapter trimmed, should be 0
- qualtrim_bp
- Base pairs quality trimmed
- out_bp
- Base pairs written
The input basepair and read count values come from cutadapt and therefore are based on the output of the k-mer trimming operation.