Link Search Menu Expand Document

Shared files

Certain file formats occur multiple times in one pipeline or are shared across pipelines.


Table of contents

  1. Benchmark files
  2. Sequence data files
    1. Overview files
    2. Plot data files
  3. Sequence data comparison files

Benchmark files

These files are produced with Snakemakes benchmark directive. The memory values are recorded with the psutil library.

Source: StackOverflow - heathobrien, StackOverflow - Victor S

s
Summed duration (seconds) a CPU core was used
h:m:s
Same as s
max_rss
Resident set size. Actual memory stored in RAM the process used. Includes shared memory.
max_vms
Virtual memory size. The process may have allocated memory, but never actually stored something in it or mapped it to something else. Includes shared memory.
max_uss
Unique set size. RAM actually used by the process. Doesn't include shared memory.
max_pss
Proportional set size. USS together with shared memory. The shared memory is proportionally distributed among its processes.
io_in
Megabytes read from disk
io_out
Megabytes written to disk
mean_load
s divided by the run time multiplied with 100.

The memory properties are reported as in megabytes and are summed across all subprocesses. Only the highest value at any point is reported. Use PSS to predict memory the rule may use again in the future, as it includes the memory shared across subprocesses.

Sequence data files

For KnuttReadsBins read preparation steps multiple FASTQ files are produced. Certain statistics of the sequences and their PHRED score values are reused in the reports. Two files can be generated for every sample and FASTQ file.

Overview files

One line represents one FASTQ file.

read
Read type of the sequences (R1, R2, merged)
minReadLen
Shortest sequence length in the file
quantile25ReadLen
25th percentile of all read lengths
medianReadLen
50th percentile of all read length
quantile75ReadLen
75th percentile of all read length
maxReadLen
Longest sequence length
meanReadLen
Average read length
reads
Number of reads in the file
basepairs
Number of basepairs in the file

Plot data files

These files are a little bit more complex, as they contain multiple datasets meant for direct x,y,(z) plots.

read
Read type of the sequences (R1, R2, merged)
type
Dataset this points belong to
x
x value
y
y value
z
z value, not present for all datasets
Dataset Description x y z
gccontent_density Read GC content distribution Read GC content Density estimate NA
avg_seq_qual Mean read PHRED score distribution Mean PHRED score Density estimate NA
cycle_qual_counts Occurence table of the PHRED scores Cycle/Position PHRED score Occurrence count
readlength_density Read length distribution Read length Density estimate NA

The density estimates are calculated with Rs density with 512 points between the smallest and biggest value(from,to). The default parameters would estimate values outside this range.

Sequence data comparison files

The impact of steps like adapter and quality trimming can be important to know. KnuttReads2Bins produces such files given information on quality and length changes. One line represents one file. Average quality refers to the mean of all PHRED scores of a single sequence.

read
Read type of the sequences (R1, R2, merged)
minReadLenChange
Smallest sequence length change
quantile25ReadLenChange
25th percentile of all read length changes
medianReadLenChange
50th percentile of all read length changes
quantile75ReadLenChange
75th percentile of all read length changes
maxReadLenChange
Biggest sequence length change
meanReadLenChange
Average sequence length change
minAvgQualChange
Smallest change of the average sequence quality
quantile25AvgQualChange
25th percentile of all average quality changes
medianAvgQualChange
50th percentile of all average quality changes
quantile75AvgQualChange
75th percentile of all average quality changes
maxAvgQualChange
Biggest change of average quality
meanAvgQualChange
Mean change of average read qualities
newentries
ID count not in the first file (should be 0)
deletedentries
ID count no longer in the second file
matchingentries
Number of IDs found in both files