Shared files

Certain file formats occur multiple times in one pipeline or are shared across pipelines.

Benchmark files
Sequence data files
1. Overview files
2. Plot data files
Sequence data comparison files

Benchmark files

These files are produced with Snakemakes benchmark directive. The memory values are recorded with the psutil library.

Source: StackOverflow - heathobrien, StackOverflow - Victor S

s: Summed duration (seconds) a CPU core was used
h:m:s: Same as s
max_rss: Resident set size. Actual memory stored in RAM the process used. Includes shared memory.
max_vms: Virtual memory size. The process may have allocated memory, but never actually stored something in it or mapped it to something else. Includes shared memory.
max_uss: Unique set size. RAM actually used by the process. Doesn't include shared memory.
max_pss: Proportional set size. USS together with shared memory. The shared memory is proportionally distributed among its processes.
io_in: Megabytes read from disk
io_out: Megabytes written to disk
mean_load: s divided by the run time multiplied with 100.

The memory properties are reported as in megabytes and are summed across all subprocesses. Only the highest value at any point is reported. Use PSS to predict memory the rule may use again in the future, as it includes the memory shared across subprocesses.

Sequence data files

For KnuttReadsBins read preparation steps multiple FASTQ files are produced. Certain statistics of the sequences and their PHRED score values are reused in the reports. Two files can be generated for every sample and FASTQ file.

Overview files

One line represents one FASTQ file.

read: Read type of the sequences (R1, R2, merged)
minReadLen: Shortest sequence length in the file
quantile25ReadLen: 25th percentile of all read lengths
medianReadLen: 50th percentile of all read length
quantile75ReadLen: 75th percentile of all read length
maxReadLen: Longest sequence length
meanReadLen: Average read length
reads: Number of reads in the file
basepairs: Number of basepairs in the file

Plot data files

These files are a little bit more complex, as they contain multiple datasets meant for direct x,y,(z) plots.

read: Read type of the sequences (R1, R2, merged)
type: Dataset this points belong to
x: x value
y: y value
z: z value, not present for all datasets

Dataset	Description	x	y	z
gccontent_density	Read GC content distribution	Read GC content	Density estimate	NA
avg_seq_qual	Mean read PHRED score distribution	Mean PHRED score	Density estimate	NA
cycle_qual_counts	Occurence table of the PHRED scores	Cycle/Position	PHRED score	Occurrence count
readlength_density	Read length distribution	Read length	Density estimate	NA

The density estimates are calculated with Rs density with 512 points between the smallest and biggest value(from,to). The default parameters would estimate values outside this range.

Sequence data comparison files

The impact of steps like adapter and quality trimming can be important to know. KnuttReads2Bins produces such files given information on quality and length changes. One line represents one file. Average quality refers to the mean of all PHRED scores of a single sequence.

read: Read type of the sequences (R1, R2, merged)
minReadLenChange: Smallest sequence length change
quantile25ReadLenChange: 25th percentile of all read length changes
medianReadLenChange: 50th percentile of all read length changes
quantile75ReadLenChange: 75th percentile of all read length changes
maxReadLenChange: Biggest sequence length change
meanReadLenChange: Average sequence length change
minAvgQualChange: Smallest change of the average sequence quality
quantile25AvgQualChange: 25th percentile of all average quality changes
medianAvgQualChange: 50th percentile of all average quality changes
quantile75AvgQualChange: 75th percentile of all average quality changes
maxAvgQualChange: Biggest change of average quality
meanAvgQualChange: Mean change of average read qualities
newentries: ID count not in the first file (should be 0)
deletedentries: ID count no longer in the second file
matchingentries: Number of IDs found in both files