overview

htseq-clip

htseq-clip is a toolset designed for the processing and analysis of eCLIP/iCLIP dataset. This package is designed primarily to do the following operations:

Prepare annotation

A suite of functions to process and flatten genome annotation file.

annotation

annotation function takes as input a GFF formatted genome annotation file and converts the annotations from GFF format to bed format. For an example, this function converts the following GFF annotation

chr1 HAVANA exon 1373730 1373902 . - . ID=exon:ENST00000338338.9:4;Parent=ENST00000338338.9;gene_id=ENSG00000175756.13;transcript_id=ENST00000338338.9;gene_type=protein_coding;gene_name=AURKAIP1;transcript_type=protein_coding;transcript_name=AURKAIP1-202;exon_number=4;exon_id=ENSE00001611509.1;level=2;protein_id=ENSP00000340656.5;transcript_support_level=1;tag=basic,appris_principal_1,CCDS;ccdsid=CCDS25.1;havana_gene=OTTHUMG00000001413.3;havana_transcript=OTTHUMT00000004082.2

and converts this entry into the following BED6 format

chromosome start end name score strand
chr1 13737329 1373902 ENSG00000175756.13@AURKAIP1@protein_coding@exon@2/2@ENSG00000175756.13:exon0002 0 -

Various attributes in the name column in this BED entry is seperated by @ and the order is given below

atrribute attribute description
ENSG00000175756.13 gene id
AURKAIP1 gene name
protein_coding gene type
exon gene feature (exon, intron, CDS,…)
2/4 2nd exon out of a total of 2 exons of this gene
ENSG00000175756.13:exon0002 unique id, merging gene id feature and feature number

score column in the BED file is re-purposed to indicate a flag which can be used as a measure of trust worthiness/ as a filter option for further analysis.

Flag can have the following different values:

Flag description
trust
worthiness
3 only one variant of start/end positions high
2 same start position but different end positions medium
1 different start positions but same end position medium
0 different start and end positions low

An exon from a gene can belong to multiple isoforms and therefore can have different start/end positions. htseq-clip combines all the position informations for each exon to one and takes the lowest/highest value as start/end position. As it is shown in the cartoon below, the first exon belongs to 3 different isoforms, so the Flag is 0` (trust worthiness: low) as the start and end positions varies. The second exon belongs to two different isoforms, but there is only one unique start and one unique end postion, hence the Flag is 3 (trust worthiness: high)

_images/flags.png

Cartoon showing flag generation process

The corresponding intron Flag is calculated as follows: if the left exon Flag is 0 and the right exon Flag is 3 the intron Flag is 1 : because for the start position(s) can exist different variants, but for the end position(s) there exist only one variant. The intron flag is calculated depending on the 2 exon flags where the intron is between. Given below is a table to lookup which variations of exon flags yield to the corresponding intron flag.

_images/lookup.png

Intron Flag lookup table

createSlidingWindows

createSlidingWindows function takes as input a flattened annotation BED file created by the annotation function and splits each individual BED entries into overlapping windows. --windowSize parameter controls the size of each window and --windowStep controls the overlap of each neighboring windows from the same feature

Continuing with the example entry above, the first 5 sliding windows generated from the BED6 flattened entry are given below:

chromosome start end name score strand
chr1 1373729 1373779 ENSG00000175756.13@AURKAIP1@protein_coding@exon@2/2@ENSG00000175756.13:exon0002W00001@1 0 -
chr1 1373749 1373799 ENSG00000175756.13@AURKAIP1@protein_coding@exon@2/2@ENSG00000175756.13:exon0002W00002@2 0 -
chr1 1373769 1373819 ENSG00000175756.13@AURKAIP1@protein_coding@exon@2/2@ENSG00000175756.13:exon0002W00003@3 0 -
chr1 1373789 1373839 ENSG00000175756.13@AURKAIP1@protein_coding@exon@2/2@ENSG00000175756.13:exon0002W00004@4 0 -
chr1 1373809 1373859 ENSG00000175756.13@AURKAIP1@protein_coding@exon@2/2@ENSG00000175756.13:exon0002W00005@5 0 -

Each sliding window listed here is 50bp long, as default value for --windowSize argument is 50 and the difference between start positions of each is 20bp, as the default value for --windowStep argument is 20

Following the convention in flattened annotation the attributes in sliding windows name column are also seperated by @ and the first 5 attributes in the name column here are exactly the same as that of flattened annotation name column An example is given below

atrribute attribute description Found in flattend name attribute
ENSG00000175756.13 gene id Yes
AURKAIP1 gene name Yes
protein_coding gene type Yes
exon gene feature (exon, intron, CDS,…) Yes
2/2 2nd exon out of a total of 2 exons of this gene Yes
ENSG00000175756.13:exon0002W00001 unique id, merging gene id feature, feature number and window number (W : window) No
1 1st window of this feature No

Note

There will be zero overlap between neighboring windows from two separate gene features

Further analysis

Further analysis and processing of crosslink windows is done using R/Bioconductor package DEWSeq. Please refer to the user manual of this package for requirements, installation and help.