Tutorial: Quickstart

How to run racoon_clip

You can run racoon_clip with the following commands:

racoon_clip crosslinks --configfile <your_configfile.yaml> --cores <n_cores> [OPTIONS]
racoon_clip peaks --configfile <your_configfile.yaml> --cores <n_cores> [OPTIONS]

The crosslinks command performs the crosslink identification pipeline, while the peaks command performs both crosslink identification and peak calling.

What you need to specify in the config file:

The config file is a .yaml file that contains all the information about your data. The following input is required from the user:

  • infiles

  • samples

  • genome_fasta

  • gtf

  • either experiment_type or specific UMI and barcode length (umi1_len, umi2_len, encode_umi_length, total_barcode_len, barcodeLength)

  • read_length

  • in some cases a barcode fasta (for the demultiplexing functionality or for data with an iCLIP, iCLIP2 barcode included)

  • optional but recommended if you use the peaks module: restrict pureclip to train its model on a few chromosomes with morePureclipParameters. This will reduce the amount of memory needed.

Note

All paths need to be specified as absolute paths. Relative paths` (for example starting with ~) are not allowed.

A minimal config file would look like this:

# where to put results
wdir: "output/path" # no backslash in the end of the path
# input
infiles: "path/to/sample1.fastq path/to/sample2.fastq" # one un-demultiplexed file or multiple demultiplexed files
samples: "sample1 sample2"
# annotation
gtf: "path/to/annotation.gtf" # has to be unzipped at the moment
genome_fasta: "path/to/genome_assembly.fa" # has to be unzipped or bgzip
star_index: "" # optional prebuilt STAR index directory
read_length: N

# experiemnt type
experiment_type: "iCLIP"/"iCLIP2"/"iCLIP3"/"eCLIP_5ntUMI"/"eCLIP_10ntUMI"/"eCLIP_ENCODE_5ntUMI"/"eCLIP_ENCODE_10ntUMI"/"noBarcode_noUMI"/"other"

# for the demultiplexing functionality or for data with experiment_type "iCLIP", "iCLIP2", or "iCLIP3"
barcodes_fasta: "path/to/barcodes.fasta" # barcodes need to have the same names as specified in the samples parameter above

# peakcalling setting (recommended)
morePureclipParameters: "-iv 'chr1;chr2;chr3;'"

What is my experiement_type?

The experiment_type specifies the barcode and adapter setup in your data. You can choose from the following options, or use a custom setup.

  • iCLIP: two UMI parts (3nt and 2nt) interspaced by the experimental barcode (4nt)

  • iCLIP2: two UMI parts (5nt and 4nt) interspaced by the experimental barcode (6nt)

  • iCLIP3: UMI of 9nt (at the 5’ end)

  • eCLIP: UMI of 10nt or 5nt in the beginning (5’ end) of read2. This option can be used for both eCLIP and seCLIP. Specify “eCLIP_10ntUMI” or “eCLIP_10ntUMI”.

  • eCLIP from ENCODE: UMI of 10nt or 5nt in the beginning (5’ end) of read2 is already trimmed off and stored in the read name. Specify “eCLIP_ENCODE_5ntUMI” or “eCLIP_ENCODE_10ntUMI”.

  • UMI and barcode are already trimmed off: If your data does not contain the UMI and barcode information anymore choose “noBarcode_noUMI” irrespective of what experiment the data is from. This is often the case for files downloaded from SRA.

How to turn optional steps on or off

You can use the following parameters to turn steps on or off:

demultiplex: True/False
quality_filter_barcodes: True/False
adapter_trimming: True/False
trim3: True/False
deduplicate: True/False

Demultiplexing

Demultiplexing is only possible for single-end read data (e.g iCLIP and iCLIP2, not eCLIP). Both the UMI and the barcode need to be positioned at the beginning of the read.

  • demultiplex (True/False): default False; Whether demultiplexing still has to be done.

  • barcodes_fasta (path to fasta): Path to fasta file of antisense sequences of the used barcodes. Not needed if data is already demultiplexed. UMI sequences should be added as N.

This is an example of a barcode fasta for an iCLIP experiment. It is important that the barcode names (after >) are exactly the same as the specified sample names and the names of the input read files. The UMIs are added as Ns.

>min_expamle_iCLIP_s1
NNNGGTTNN
>min_expamle_iCLIP_s2
NNNGGCGNN

Quality filtering during barcode trimming

  • flexbar_minReadLength (int): default 15; The minimum length a read should have after trimming of barcodes, adapters and UMIs. Shorter reads are removed.

  • quality_filter_barcodes (True/False): default True; Whether reads should be filtered for a minimum sequencing quality in the barcode sequence.

  • minBaseQuality (int): default 10; The minimum per-base quality of the barcode region of each read. Reads below this threshold are filtered out. This only applies if quality_filter_barcodes is set to True.

Adapters

  • adapter_trimming (True/False): default True; Whether adapter trimming should be performed.

  • adapter_file (path): default /params.dir/adapters.fa; A fasta file of adapters that should be trimmed. The default file contains the Illumina Universal adapter, the Illumina Multiplexing adapter and 20 eCLIP adapters.

  • adapter_cycles (int): default 1; How many cycles of adapter trimming should be performed. We recommend using 1 for iCLIP and iCLIP2 data and 2 for eCLIP.

Trimming at the 3’ end

  • trim3 (True/False): default False; Whether nucleotides should be trimmed of the 3’ end of the reads. This is necessary for iCLIP3.

  • trim3_len (int): default 3; The number of nucleotides to be trimmed off.

Deduplication

  • deduplicate (True/False): default True; Whether to perform deduplication. It is recommended always to use deduplication unless no UMIs are present in the data.

How to customise racoon_clips behaviour

Check out how to customise racoon_clip in the Tutorial: customise racoon_clip section.