Shared files
Certain file formats occur multiple times in one pipeline or are shared across pipelines.
Table of contents
Benchmark files
These files are produced with Snakemakes benchmark
directive. The memory values are recorded with the psutil
library.
Source: StackOverflow - heathobrien, StackOverflow - Victor S
- s
- Summed duration (seconds) a CPU core was used
- h:m:s
- Same as s
- max_rss
- Resident set size. Actual memory stored in RAM the process used. Includes shared memory.
- max_vms
- Virtual memory size. The process may have allocated memory, but never actually stored something in it or mapped it to something else. Includes shared memory.
- max_uss
- Unique set size. RAM actually used by the process. Doesn't include shared memory.
- max_pss
- Proportional set size. USS together with shared memory. The shared memory is proportionally distributed among its processes.
- io_in
- Megabytes read from disk
- io_out
- Megabytes written to disk
- mean_load
- s divided by the run time multiplied with 100.
The memory properties are reported as in megabytes and are summed across all subprocesses. Only the highest value at any point is reported. Use PSS to predict memory the rule may use again in the future, as it includes the memory shared across subprocesses.
Sequence data files
For KnuttReadsBins read preparation steps multiple FASTQ files are produced. Certain statistics of the sequences and their PHRED score values are reused in the reports. Two files can be generated for every sample and FASTQ file.
Overview files
One line represents one FASTQ file.
- read
- Read type of the sequences (R1, R2, merged)
- minReadLen
- Shortest sequence length in the file
- quantile25ReadLen
- 25th percentile of all read lengths
- medianReadLen
- 50th percentile of all read length
- quantile75ReadLen
- 75th percentile of all read length
- maxReadLen
- Longest sequence length
- meanReadLen
- Average read length
- reads
- Number of reads in the file
- basepairs
- Number of basepairs in the file
Plot data files
These files are a little bit more complex, as they contain multiple datasets meant for direct x,y,(z) plots.
- read
- Read type of the sequences (R1, R2, merged)
- type
- Dataset this points belong to
- x
- x value
- y
- y value
- z
- z value, not present for all datasets
Dataset | Description | x | y | z |
---|---|---|---|---|
gccontent_density | Read GC content distribution | Read GC content | Density estimate | NA |
avg_seq_qual | Mean read PHRED score distribution | Mean PHRED score | Density estimate | NA |
cycle_qual_counts | Occurence table of the PHRED scores | Cycle/Position | PHRED score | Occurrence count |
readlength_density | Read length distribution | Read length | Density estimate | NA |
The density estimates are calculated with Rs density
with 512 points between the smallest and biggest value(from,to
). The default parameters would estimate values outside this range.
Sequence data comparison files
The impact of steps like adapter and quality trimming can be important to know. KnuttReads2Bins produces such files given information on quality and length changes. One line represents one file. Average quality refers to the mean of all PHRED scores of a single sequence.
- read
- Read type of the sequences (R1, R2, merged)
- minReadLenChange
- Smallest sequence length change
- quantile25ReadLenChange
- 25th percentile of all read length changes
- medianReadLenChange
- 50th percentile of all read length changes
- quantile75ReadLenChange
- 75th percentile of all read length changes
- maxReadLenChange
- Biggest sequence length change
- meanReadLenChange
- Average sequence length change
- minAvgQualChange
- Smallest change of the average sequence quality
- quantile25AvgQualChange
- 25th percentile of all average quality changes
- medianAvgQualChange
- 50th percentile of all average quality changes
- quantile75AvgQualChange
- 75th percentile of all average quality changes
- maxAvgQualChange
- Biggest change of average quality
- meanAvgQualChange
- Mean change of average read qualities
- newentries
- ID count not in the first file (should be 0)
- deletedentries
- ID count no longer in the second file
- matchingentries
- Number of IDs found in both files