Tutorial: Quickstart
Contents
How to run racoon_clip
You can run racoon_clip with the following command:
racoon_clip run --configfile <your_configfile.yaml> --cores <n_cores> [OPTIONS]
What you need to specify in the config file
The config file is a .yaml file that contains all the information about your data. The following input is required from the user:
infiles
samples
genome_fasta
gtf
either experiment_type or specific UMI and barcode length (umi1_len, umi2_len, encode_umi_length, exp_barcode_len, barcodeLength)
read_length
in some cases a barcode fasta (for the demultiplexing functionality or for data with an iCLIP or iCLIP2 barcode included)
Note
All paths need to be specified as absolute paths. Relative paths` (for example starting with ~) are not allowed.
A minimal config file would therefore look like this:
# where to put results
wdir: "output/path" # no backslash in the end of the path
# input
infiles: "path/to/sample1.fastq path/to/sample2.fastq" # one un-demultiplexed file or multiple demultiplexed files
samples: "sample1 sample2"
# annotation
gtf: "path/to/annotation.gtf" # has to be unzipped at the moment
genome_fasta: "path/to/genome_assembly.fa" # has to be unzipped or bgzip
read_length: N
# experiemnt type
experiment_type: "iCLIP"/"iCLIP2"/"eCLIP_5ntUMI"/"eCLIP_10ntUMI"/"eCLIP_ENCODE_5ntUMI"/"eCLIP_ENCODE_10ntUMI"/"noBarcode_noUMI"/"other"
# for the demultiplexing functionality or for data with experiment_type "iCLIP" or "iCLIP2"
barcodes_fasta: "path/to/barcodes.fasta" # barcodes need to have the same names as specified in the samples parameter above
What is my experiement_type?
The experiment_type specifies the barcode and adapter setup in your data. You can choose from the following options, or use a custom setup.
iCLIP: two UMI parts (3nt and 2nt) interspaced by the experimental barcode (4nt)
iCLIP2: two UMI parts (5nt and 4nt) interspaced by the experimental barcode (6nt)
eCLIP UMI of 10nt or 5nt in the beginning (5’ end) of read2. Specify “eCLIP_10ntUMI” or “eCLIP_10ntUMI”.
eCLIP from ENCODE: UMI of 10nt or 5nt in the beginning (5’ end) of read2 is already trimmed off and stored in the read name. Specify “eCLIP_ENCODE_5ntUMI” or “eCLIP_ENCODE_10ntUMI”.
UMI and barcode are already trimmed off: If your data does not contain the UMI and barcode information anymore choose “noBarcode_noUMI” irrespective of what experiment the data is from. This is often the case for files downloaded from SRA.
Which steps will racoon_clip run by default
This depends on the experiment_type. If not specified otherwise racoon_clip will run the following:
How to turn optional steps on or off
You can use the following parameters to turn steps on or off:
demultiplex: True/False
quality_filter_barcodes: True/False
adapter_trimming: True/False
deduplicate: True/False
Demultiplexing
Demultiplexing is at the moment only possible for single-end read data. Both the UMI and the barcode need to be positioned in the beginning of the read.
demultiplex (True/False): default False; Whether demultiplexing still has to be done.
barcodes_fasta (path to fasta): Path to fasta file of antisense sequences of used barcodes. Not needed if data is already demultiplexed. UMI sequences should be added as N.
This is an example of a barcode fasta for an iCLIP experiment. It is important that the barcode names (after >) are exactly the same as the specified sample names and the names of the input read files. The UMIs are added as Ns.
>min_expamle_iCLIP_s1
NNNGGTTNN
>min_expamle_iCLIP_s2
NNNGGCGNN
Quality filtering during barcode trimming
flexbar_minReadLength (int): default 15; The minimum length a read should have after trimming of barcodes, adapters and UMIs. Shorter reads are removed.
quality_filter_barcodes (True/False): default True; Whether reads should be filtered for a minimum sequencing quality in the barcode sequence.
minBaseQuality (int): default 10; The minimum per base quality of the barcode region of each read. Reads below this threshold are filtered out. This only applies if quality_filter_barcodes is set to True.
Adapters
adapter_trimming (True/False): default True; Whether adapter trimming should be performed.
adapter_file (path): default /params.dir/adapters.fa; A fasta file of adapters that should be trimmed. The default file contains the Illumina Universal adapter, the Illumina Multiplexing adapter and 20 eCLIP adapters.
adapter_cycles (int): default 1; How many cycles of adapter trimming should be performed. We recommend using 1 for iCLIP and iCLIP2 data and 2 for eCLIP.
Deduplication
deduplicate (True/False): default True; Whether to perform deduplication. It is recommended to always use deduplication unless no UMIs are present in the data.
How to customise racoon_clips behaviour
Check out how to customise racoon_clip here.