pyCRAC software
This page shows a few command lines that demonstrate how you can use the pyCRAC software package to process CLIP or CRAC data.
You can use python pip to install pyCRAC: sudo pip install pyCRAC.
Note that the latest version of pyCRAC is compattible with Python 3.
The latest test versions can also be downloaded from:
A software package without bugs does not exist.
Please check the github repository regularly for bug fixes and other updates.
Feedback is also much appreciated!!
Please site the following paper when using the package for your data analyses:
Shaun Webb, Ralph D. Hector, Grzegorz Kudla and Sander Granneman.
PAR-CLIP data indicate that Nrd1-Nab3-dependent transcription termination regulates expression of hundreds of protein coding genes in yeast.
Genome Biology 2014;15:R8
GTF2_parser_examples
A few examples of how to use GTF2 parser in your python programs
First, I import the GTF2 parser and the default GTF and tab file (yeast, Saccharomyces cerevisiae). I will also need numpy later on.
In [1]:
from pyCRAC.Parsers import GTF2
from pyCRAC.Methods import translate
import numpy as np
import re
Next I create a gtf object that will inherit all the GTF2 class features:
In [2]:
gtf = GTF2.Parse_GTF()
gtf.read_GTF("Saccharomyces_cerevisiae.R64-1-1.75_1.2.gtf",transcripts=False) # or your own GTF file
gtf.read_FASTA("Saccharomyces_cerevisiae.R64-1-1.75.fa") # or your own tab file
Below a few examples on how to use it. Let's say we would like to have the genomic coordinates for the gene RPL7A, which encodes a ribosomal protein:
In [3]:
gene = "RPL7A"
In [4]:
gtf.chromosomeCoordinates(gene)
Out[4]:
This returns the chromosomal start and end positions of the gene, including the UTRs. If UTRs information is not annotated in your GTF file you can use the "ranges" option to add flanking sequences:
In [5]:
gtf.chromosomeCoordinates(gene,ranges=200)
Out[5]:
Next I want to know on which chromosome and strand this gene is:
In [6]:
gtf.strand(gene),gtf.chromosome(gene)
Out[6]:
OK, so RPL7A is on the crick (-) strand on chromosome seven. Next I am testing all the functions on RPL7A:
In [7]:
gtf.utrCoordinates(gene,ranges=200)
Out[7]:
In [8]:
gtf.utrCoordinates(gene)
Out[8]:
In [9]:
gtf.cdsCoordinates(gene)
Out[9]:
In [10]:
gtf.codingSequenceCoordinates(gene)
Out[10]:
In [11]:
gtf.exonCoordinates(gene)
Out[11]:
In [12]:
gtf.intronCoordinates(gene)
Out[12]:
In [13]:
gtf.fivePrimeSpliceSites(gene)
Out[13]:
In [14]:
gtf.threePrimeSpliceSites(gene)
Out[14]:
In [15]:
gtf.fivePrimeEnd(gene)
Out[15]:
In [16]:
gtf.threePrimeEnd(gene)
Out[16]:
In [17]:
gtf.chromosomeCoordinates(gene)
Out[17]:
In [18]:
gtf.cdsStart(gene)
Out[18]:
In [19]:
gtf.cdsEnd(gene)
Out[19]:
To extract nucleotide positions one can also create numpy arrays for each position:
In [20]:
gtf.geneIterCoordinates(gene)
Out[20]:
In [21]:
gtf.geneIterCoordinates(gene,coordinates="coding")
Out[21]:
Next I want to have the splice site sequences for RPL7A for each intron:
In [22]:
intronnumber = 0
for intron in gtf.intronSequences(gene):
intronnumber += 1
print("intron %s:\n5' splice site:\t%s\n3' splice site:\t%s\n" % \
(intronnumber,intron[:5],intron[-5:]))
What are the UTR coordinates for RPL7A?
In [23]:
utrcoordinates = gtf.utrCoordinates(gene)
fiveutr = utrcoordinates[0]
threeutr = utrcoordinates[1]
print("5'UTR:\t%s\n3'UTR:\t%s" % (fiveutr,threeutr))
OK, now I would like to have the genomic sequence for RPL7A in a readable fasta format:
In [24]:
genomicseq = gtf.genomicSequence(gene,format="fasta",split=True)
print(genomicseq)
What if I just want the coding sequence?
Below the coding sequence from SGD:
In [25]:
sgdseq = "ATGGCCGCTGAAAAAATCTTGACCCCAGAATCTCAGTTGAAGAAGT\
CTAAGGCTCAACAAAAGACTGCTGAACAAGTCGCTGCTGAAA\
GAGCTGCTCGTAAGGCTGCTAACAAGGAAAAGAGAGCCATTA\
TTTTGGAAAGAAACGCCGCTTACCAAAAGGAATACGAAACTG\
CTGAAAGAAACATCATTCAAGCTAAGCGTGATGCCAAGGCTG\
CTGGTTCCTACTACGTCGAAGCTCAACACAAGTTGGTCTTCG\
TTGTCAGAATCAAGGGTATTAACAAGATCCCACCTAAGCCAA\
GAAAGGTTCTACAATTGCTAAGATTGACAAGAATCAACTCTG\
GTACATTCGTCAAAGTTACCAAGGCTACTTTGGAACTATTGA\
AGTTGATTGAACCATACGTTGCTTACGGTTACCCATCGTACT\
CTACTATTAGACAATTGGTCTACAAGAGAGGTTTCGGTAAGA\
TCAACAAGCAAAGAGTTCCATTGTCCGACAATGCTATCATCG\
AAGCCAACTTGGGTAAGTATGGTATCTTGTCCATTGACGATT\
TGATTCACGAAATCATCACTGTTGGTCCACACTTCAAGCAAG\
CTAACAACTTTTTGTGGCCATTCAAGTTGTCCAACCCATCTG\
GTGGTTGGGGTGTCCCAAGAAAGTTCAAGCACTTTATCCAAG\
GTGGTTCTTTCGGTAACCGTGAAGAATTCATCAACAAATTGG\
TTAAGTCCATGAAC"
In [26]:
gtf.codingSequence(gene)
Out[26]:
In [27]:
gtf.codingSequence(gene) == sgdseq
Out[27]:
In [28]:
codingseq = gtf.codingSequence(gene,format="fasta",split=True)
print(codingseq)
Does the sequence translate properly?
In [29]:
protein = translate(gtf.codingSequence("RPL7A"))
print(protein)
Now for some more complicated examples. What if I want to calculate overlap between interval coordinates and gtf feature intervals? I know my intervals map to chromosome I so what I need is a list of start and end positions for each gene on chromosome I and also the gene name:
In [30]:
coordinates = gtf.chromosomeGeneCoordIterator("I")
Just want to see the first ten coordinates:
In [31]:
coordinates[:10]
Out[31]:
OK but what if I know that my intervals are only on the '-' strand?
In [32]:
coordinates = gtf.chromosomeGeneCoordIterator("I",strand="-")
coordinates[:10]
Out[32]:
But actually I only care about protein coding genes:
In [33]:
coordinates = gtf.chromosomeGeneCoordIterator("I",strand="-",annotation="protein_coding")
coordinates[:10]
Out[33]:
To calculate overlap the GTF2 parsee uses numpy arrays:
In [34]:
coordinates = gtf.chromosomeGeneCoordIterator("I",strand="-",annotation="protein_coding",numpy=True)
coordinates[:10]
Out[34]:
Just to show you how it works an example below. Let's say I have an interval that maps to chrI on the '-' strand and I want to know what features overlap with this interval
In [35]:
intervalstart = 134343
intervalend = 135904
Next I need to import the numpy_overlap method from the Methods file:
In [36]:
from pyCRAC.Methods import numpy_overlap
OK let's find some genes that overlap sense with this interval:
In [37]:
mygenecoordinates = gtf.chromosomeGeneCoordIterator("I",strand="-",numpy=True)
numpy_overlap(mygenecoordinates,intervalstart,intervalend)
Out[37]:
This shows that my interval maps to an intergenic region (INT) and the MDM10 gene). What if I don't care on which strand the feature is located (i.e. also include anti-sense reads)?¶
In [38]:
mygenecoordinates = gtf.chromosomeGeneCoordIterator("I",strand=None,numpy=True)
numpy_overlap(mygenecoordinates,intervalstart,intervalend)
Out[38]:
Right so SPO7 is anti-sense to my gene. If I just wanted to know the anti-sense hits:
In [39]:
mygenecoordinates = gtf.chromosomeGeneCoordIterator("I",strand="+",numpy=True)
numpy_overlap(mygenecoordinates,intervalstart,intervalend)
Out[39]:
But what if I want to have at least 100 nucleotide overlap with the features?
In [40]:
mygenecoordinates = gtf.chromosomeGeneCoordIterator("I",strand=None,numpy=True)
numpy_overlap(mygenecoordinates,intervalstart,intervalend,overlap=100)
Out[40]:
What genes are there on chromosome I (only showing the first 20)?
In [41]:
gtf.chromosomesGenesList("I")[:20]
Out[41]:
Actally I just want the protein coding genes:
In [42]:
gtf.chromosomesGenesList("I",annotation="protein_coding")[:20]
Out[42]:
What about snoRNAs (showing all present)?¶
In [43]:
gtf.chromosomesGenesList("I",annotation="snoRNA")
Out[43]:
What about tRNAs? (sbowing all present)?
In [44]:
gtf.chromosomesGenesList("I",annotation="tRNA")
Out[44]:
Well I guess you get the idea by now...
The geneIterCoordinates class method is very useful for generating nucleotide pileups because it gives you the flexibility to add flanking sequences and/or remove introns. An example:
In [45]:
gtf.chromosomeCoordinates("MDM10")
Out[45]:
In [46]:
nucleotidecoord = gtf.geneIterCoordinates("MDM10")
nucleotidecoord
Out[46]:
Where does my interval overlap with the MDM10 coordinates? Numpy has this great function called 'where'.
In [47]:
intervalstart = 134143
intervalend = 135604
np.where((intervalstart <=nucleotidecoord) & (intervalend >=nucleotidecoord))
Out[47]:
That gives you the indexes for the positions where there is overlap. But if I want to make a pileup, I have to keep track of where it overlaps:
In [48]:
pileup = np.where((intervalstart <=nucleotidecoord) & (intervalend >=nucleotidecoord),1,0)
pileup
Out[48]:
We can use this trick to make a pileup for our gene of interest to make a plot for the number of reads that map to our favorite gene
Below an example. I want to make a pileup for all the intervals that mapped to RPL7A
In [49]:
nucleotidecoord = gtf.geneIterCoordinates("RPL7A") # generate an array with chromosomal coordinates for each nucleotide in RPL7A
pileup = np.zeros(len(nucleotidecoord)) # make an array to collect the hits
start = np.random.randint(364249,366023,1000) # make an array with start coordinates.
end = start+30 # read length = 30 nt.
coordinates = zip(start,end) # merge them together
#print(list(coordinates))
for x,y in list(coordinates):
pileup += np.where((x <=nucleotidecoord) & (y >=nucleotidecoord),1,0)
print(pileup)
So now I added another overlapping interval to the pileup. To print the pileup, I simply do the following:
In [50]:
genomicsequence = gtf.genomicSequence("RPL7A")
for pos,nuc in enumerate(genomicsequence):
print("%s\t%s\t%s" % (pos+1,genomicsequence[pos],pileup[pos]))
OK but what if I know that my reads only map to spliced RPL7A?
In [51]:
nucleotidecoord = gtf.geneIterCoordinates("RPL7A",coordinates="coding") # generate an array with chromosomal coordinates for each nucleotide in RPL7A
pileup = np.zeros(len(nucleotidecoord)) # make an array to collect the hits
for start,end in coordinates:
pileup += np.where((start <=nucleotidecoord) & (end >=nucleotidecoord),1,0)