Detailed description of steps performed by racoon_clip 

Contents

Detailed description of steps performed by racoon_clip

Quality control 

Basic quality controls are performed several times throughout the workflow using FastQC (v0.12.1), MultiQC (v.1.31). Optionally, FastQ Screen (v0.15.3) can run using a custom FastQ Screen config file.

Quality filtering (Optional)

Sequencing reads can be filtered for a Phred score >= 10 inside the unique molecular identifier (UMI) at positions 1-10 of each read to ensure reliable sample and duplicate assignment. The cutoff can be changed by specifying another value by the racoon_clip minBaseQuality option.

Demultiplexing, UMI & Adapter trimming 

Demultiplexing and 3’ adapters adapter trimming are performed with FLEXBAR (version 3.5.0). FLEXBAR also handles UMIs and trims barcodes.

If demultiplexing is turned on, this is done with the FLEXBAR via the provided barcode_fasta with FLEXBAR parameters --barcodes {input.barcodes} --barcode-unassigned --barcode-error-rate 0.

3’ adapters are trimmed using FLEXBAR options --adapter-trim-end RIGHT --adapter-error-rate 0.1 --adapter-min-overlap 1 --adapter-cycles <as_specified> by default, but adapter trimming can also be turned off.

At the same time, UMIs (and barcodes, if present) are trimmed from the 5’ end of the reads and stored in the read names using FLEXBAR options --umi-tags --barcode-trim-end LTAIL.

For iCLIP3, or when the parameter trim3 True is used, a number of nucleotides is trimmed of the 3’end of the reads (default 3) with FLEXBAR --zip-output GZ -y 3 --min-read-length 15. This is necessary for iCLIP3, which uses a second 3nt-long UMI at the 3’ end.

Reads that are shorter than 15 nt after trimming are discarded using the FLEXBAR option --min-read-length 15. The cutoff can be changed by specifying another value with the racoon_clip flexbar_minReadLength option.

Genome alignment 

Reads are aligned to the specified genome with STAR (version 2.7.10). In short, the genome is indexed with STAR –runMode genomeGenerate. Then, the reads of each sample are individually aligned to the genome with STAR –runMode alignReads --sjdbOverhang 139 --outFilterMismatchNoverReadLmax 0.04 --outFilterMismatchNmax 999 --outFilterMultimapNmax 1 --alignEndsType "Extend5pOfRead1" --outReadsUnmapped "Fastx" --outSJfilterReads "Unique". Obtained bam files are indexed with SAMtools index (version 1.11). All parameters except --alignEndsType "Extend5pOfRead1" can be changed via racoon_clip options.

Deduplication 

Aligned reads are deduplicated with umi_tools dedup --extract-umi-method read_id --method unique (UMI-tools version 1.1.1).

Assignment of crosslink sites of CLIP reads 

The deduplicated bam files are converted into bed files using bedtools bamtobed (version 2.30.0). The reads are shifted by 1 nt upstream with bedtools shift -m 1 -p -1 because the UV crosslink sites should be positioned 1 nt upstream of the eCLIP read starts. The bed files are split into plus and minus strands, and the reads are then reduced to 1-nt crosslink events using awk. To allow for visualization, the bed files of 1 nt events are converted to bigWig files using bedGraphToBigWig (ucsc-bedgraphtobigwig version 377). Additionally, the bigWig files of replicates are merged by groups with bigWigMerge (ucsc-bigwigmerge version 377).

Peakcalling 

Peaks are called with PureCLIP on the merged bam files from each group.