pyCRAC software

This page shows a few command lines that demonstrate how you can use the pyCRAC software package to process CLIP or CRAC data.

You can use python pip to install pyCRAC: sudo pip install pyCRAC.
Note that the latest version of pyCRAC is compattible with Python 3.
 
The latest test versions can also be downloaded from:
 
 
A software package without bugs does not exist.
Please check the github repository regularly for bug fixes and other updates.
Feedback is also much appreciated!!
 
Please site the following paper when using the package for your data analyses:
Shaun Webb, Ralph D. Hector, Grzegorz Kudla and Sander Granneman.
PAR-CLIP data indicate that Nrd1-Nab3-dependent transcription termination regulates expression of hundreds of protein coding genes in yeast.
Genome Biology 2014;15:R8 

GTF2_parser_examples

 

A few examples of how to use GTF2 parser in your python programs

 

First, I import the GTF2 parser and the default GTF and tab file (yeast, Saccharomyces cerevisiae). I will also need numpy later on.

In [1]:
from pyCRAC.Parsers import GTF2
from pyCRAC.Methods import translate
import numpy as np
import re
 

Next I create a gtf object that will inherit all the GTF2 class features:

In [2]:
gtf = GTF2.Parse_GTF()
gtf.read_GTF("Saccharomyces_cerevisiae.R64-1-1.75_1.2.gtf",transcripts=False)  # or your own GTF file
gtf.read_FASTA("Saccharomyces_cerevisiae.R64-1-1.75.fa")  # or your own tab file
 

Below a few examples on how to use it. Let's say we would like to have the genomic coordinates for the gene RPL7A, which encodes a ribosomal protein:

In [3]:
gene = "RPL7A"
In [4]:
gtf.chromosomeCoordinates(gene)
Out[4]:
(364167, 366110)
 

This returns the chromosomal start and end positions of the gene, including the UTRs. If UTRs information is not annotated in your GTF file you can use the "ranges" option to add flanking sequences:

In [5]:
gtf.chromosomeCoordinates(gene,ranges=200)
Out[5]:
(363967, 366310)
 

Next I want to know on which chromosome and strand this gene is:

In [6]:
gtf.strand(gene),gtf.chromosome(gene)
Out[6]:
('-', 'VII')
 

OK, so RPL7A is on the crick (-) strand on chromosome seven. Next I am testing all the functions on RPL7A:

In [7]:
gtf.utrCoordinates(gene,ranges=200)
Out[7]:
[(365997, 366197), (364137, 364337)]
In [8]:
gtf.utrCoordinates(gene)
Out[8]:
[(365997, 366110), (364167, 364337)]
In [9]:
gtf.cdsCoordinates(gene)
Out[9]:
[(364338, 364964), (365433, 365526), (365986, 365996)]
In [10]:
gtf.codingSequenceCoordinates(gene)
Out[10]:
[(364338, 364964), (365433, 365526), (365986, 365996)]
In [11]:
gtf.exonCoordinates(gene)
Out[11]:
[(364167, 364964), (365433, 365526), (365986, 366110)]
In [12]:
gtf.intronCoordinates(gene)
Out[12]:
[(364965, 365432), (365527, 365985)]
In [13]:
gtf.fivePrimeSpliceSites(gene)
Out[13]:
[365432, 365985]
In [14]:
gtf.threePrimeSpliceSites(gene)
Out[14]:
[364965, 365527]
In [15]:
gtf.fivePrimeEnd(gene)
Out[15]:
366110
In [16]:
gtf.threePrimeEnd(gene)
Out[16]:
364167
In [17]:
gtf.chromosomeCoordinates(gene)
Out[17]:
(364167, 366110)
In [18]:
gtf.cdsStart(gene)
Out[18]:
365996
In [19]:
gtf.cdsEnd(gene)
Out[19]:
364338
 

To extract nucleotide positions one can also create numpy arrays for each position:

In [20]:
gtf.geneIterCoordinates(gene)
Out[20]:
array([364167, 364168, 364169, ..., 366108, 366109, 366110])
In [21]:
gtf.geneIterCoordinates(gene,coordinates="coding")
Out[21]:
array([364338, 364339, 364340, 364341, 364342, 364343, 364344, 364345,
       364346, 364347, 364348, 364349, 364350, 364351, 364352, 364353,
       364354, 364355, 364356, 364357, 364358, 364359, 364360, 364361,
       364362, 364363, 364364, 364365, 364366, 364367, 364368, 364369,
       364370, 364371, 364372, 364373, 364374, 364375, 364376, 364377,
       364378, 364379, 364380, 364381, 364382, 364383, 364384, 364385,
       364386, 364387, 364388, 364389, 364390, 364391, 364392, 364393,
       364394, 364395, 364396, 364397, 364398, 364399, 364400, 364401,
       364402, 364403, 364404, 364405, 364406, 364407, 364408, 364409,
       364410, 364411, 364412, 364413, 364414, 364415, 364416, 364417,
       364418, 364419, 364420, 364421, 364422, 364423, 364424, 364425,
       364426, 364427, 364428, 364429, 364430, 364431, 364432, 364433,
       364434, 364435, 364436, 364437, 364438, 364439, 364440, 364441,
       364442, 364443, 364444, 364445, 364446, 364447, 364448, 364449,
       364450, 364451, 364452, 364453, 364454, 364455, 364456, 364457,
       364458, 364459, 364460, 364461, 364462, 364463, 364464, 364465,
       364466, 364467, 364468, 364469, 364470, 364471, 364472, 364473,
       364474, 364475, 364476, 364477, 364478, 364479, 364480, 364481,
       364482, 364483, 364484, 364485, 364486, 364487, 364488, 364489,
       364490, 364491, 364492, 364493, 364494, 364495, 364496, 364497,
       364498, 364499, 364500, 364501, 364502, 364503, 364504, 364505,
       364506, 364507, 364508, 364509, 364510, 364511, 364512, 364513,
       364514, 364515, 364516, 364517, 364518, 364519, 364520, 364521,
       364522, 364523, 364524, 364525, 364526, 364527, 364528, 364529,
       364530, 364531, 364532, 364533, 364534, 364535, 364536, 364537,
       364538, 364539, 364540, 364541, 364542, 364543, 364544, 364545,
       364546, 364547, 364548, 364549, 364550, 364551, 364552, 364553,
       364554, 364555, 364556, 364557, 364558, 364559, 364560, 364561,
       364562, 364563, 364564, 364565, 364566, 364567, 364568, 364569,
       364570, 364571, 364572, 364573, 364574, 364575, 364576, 364577,
       364578, 364579, 364580, 364581, 364582, 364583, 364584, 364585,
       364586, 364587, 364588, 364589, 364590, 364591, 364592, 364593,
       364594, 364595, 364596, 364597, 364598, 364599, 364600, 364601,
       364602, 364603, 364604, 364605, 364606, 364607, 364608, 364609,
       364610, 364611, 364612, 364613, 364614, 364615, 364616, 364617,
       364618, 364619, 364620, 364621, 364622, 364623, 364624, 364625,
       364626, 364627, 364628, 364629, 364630, 364631, 364632, 364633,
       364634, 364635, 364636, 364637, 364638, 364639, 364640, 364641,
       364642, 364643, 364644, 364645, 364646, 364647, 364648, 364649,
       364650, 364651, 364652, 364653, 364654, 364655, 364656, 364657,
       364658, 364659, 364660, 364661, 364662, 364663, 364664, 364665,
       364666, 364667, 364668, 364669, 364670, 364671, 364672, 364673,
       364674, 364675, 364676, 364677, 364678, 364679, 364680, 364681,
       364682, 364683, 364684, 364685, 364686, 364687, 364688, 364689,
       364690, 364691, 364692, 364693, 364694, 364695, 364696, 364697,
       364698, 364699, 364700, 364701, 364702, 364703, 364704, 364705,
       364706, 364707, 364708, 364709, 364710, 364711, 364712, 364713,
       364714, 364715, 364716, 364717, 364718, 364719, 364720, 364721,
       364722, 364723, 364724, 364725, 364726, 364727, 364728, 364729,
       364730, 364731, 364732, 364733, 364734, 364735, 364736, 364737,
       364738, 364739, 364740, 364741, 364742, 364743, 364744, 364745,
       364746, 364747, 364748, 364749, 364750, 364751, 364752, 364753,
       364754, 364755, 364756, 364757, 364758, 364759, 364760, 364761,
       364762, 364763, 364764, 364765, 364766, 364767, 364768, 364769,
       364770, 364771, 364772, 364773, 364774, 364775, 364776, 364777,
       364778, 364779, 364780, 364781, 364782, 364783, 364784, 364785,
       364786, 364787, 364788, 364789, 364790, 364791, 364792, 364793,
       364794, 364795, 364796, 364797, 364798, 364799, 364800, 364801,
       364802, 364803, 364804, 364805, 364806, 364807, 364808, 364809,
       364810, 364811, 364812, 364813, 364814, 364815, 364816, 364817,
       364818, 364819, 364820, 364821, 364822, 364823, 364824, 364825,
       364826, 364827, 364828, 364829, 364830, 364831, 364832, 364833,
       364834, 364835, 364836, 364837, 364838, 364839, 364840, 364841,
       364842, 364843, 364844, 364845, 364846, 364847, 364848, 364849,
       364850, 364851, 364852, 364853, 364854, 364855, 364856, 364857,
       364858, 364859, 364860, 364861, 364862, 364863, 364864, 364865,
       364866, 364867, 364868, 364869, 364870, 364871, 364872, 364873,
       364874, 364875, 364876, 364877, 364878, 364879, 364880, 364881,
       364882, 364883, 364884, 364885, 364886, 364887, 364888, 364889,
       364890, 364891, 364892, 364893, 364894, 364895, 364896, 364897,
       364898, 364899, 364900, 364901, 364902, 364903, 364904, 364905,
       364906, 364907, 364908, 364909, 364910, 364911, 364912, 364913,
       364914, 364915, 364916, 364917, 364918, 364919, 364920, 364921,
       364922, 364923, 364924, 364925, 364926, 364927, 364928, 364929,
       364930, 364931, 364932, 364933, 364934, 364935, 364936, 364937,
       364938, 364939, 364940, 364941, 364942, 364943, 364944, 364945,
       364946, 364947, 364948, 364949, 364950, 364951, 364952, 364953,
       364954, 364955, 364956, 364957, 364958, 364959, 364960, 364961,
       364962, 364963, 364964, 365433, 365434, 365435, 365436, 365437,
       365438, 365439, 365440, 365441, 365442, 365443, 365444, 365445,
       365446, 365447, 365448, 365449, 365450, 365451, 365452, 365453,
       365454, 365455, 365456, 365457, 365458, 365459, 365460, 365461,
       365462, 365463, 365464, 365465, 365466, 365467, 365468, 365469,
       365470, 365471, 365472, 365473, 365474, 365475, 365476, 365477,
       365478, 365479, 365480, 365481, 365482, 365483, 365484, 365485,
       365486, 365487, 365488, 365489, 365490, 365491, 365492, 365493,
       365494, 365495, 365496, 365497, 365498, 365499, 365500, 365501,
       365502, 365503, 365504, 365505, 365506, 365507, 365508, 365509,
       365510, 365511, 365512, 365513, 365514, 365515, 365516, 365517,
       365518, 365519, 365520, 365521, 365522, 365523, 365524, 365525,
       365526, 365986, 365987, 365988, 365989, 365990, 365991, 365992,
       365993, 365994, 365995, 365996])
 

Next I want to have the splice site sequences for RPL7A for each intron:

In [22]:
intronnumber = 0
for intron in gtf.intronSequences(gene):
    intronnumber += 1
    print("intron %s:\n5' splice site:\t%s\n3' splice site:\t%s\n" % \
          (intronnumber,intron[:5],intron[-5:]))
 
intron 1:
5' splice site:	GTATG
3' splice site:	TTTAG

intron 2:
5' splice site:	GTATG
3' splice site:	TACAG

 

What are the UTR coordinates for RPL7A?

In [23]:
utrcoordinates = gtf.utrCoordinates(gene)
fiveutr  = utrcoordinates[0]
threeutr = utrcoordinates[1]
print("5'UTR:\t%s\n3'UTR:\t%s" % (fiveutr,threeutr))
 
5'UTR:	(365997, 366110)
3'UTR:	(364167, 364337)
 

OK, now I would like to have the genomic sequence for RPL7A in a readable fasta format:

In [24]:
genomicseq = gtf.genomicSequence(gene,format="fasta",split=True)
print(genomicseq)
 
>RPL7A
CTTTGACAACTTCATCCTTTCTGAAATATTGAAATAATCAAGACACTAGA
TCTGCCTTGTATTGTCTTCGGTTCATATAAATCCAAATAACCAACCAAGC
AAATTAAGATCACAATGGCCGCTGAGTATGTATACGAAATCGTCTATTTT
ATTACTATTTCCTAAAGTATCCAGAGAAGTATACTAGTTTCCGCAAGTTA
CGTTGAGAAAAAGGAGCAATTCTGCTCTAATAGCATTTGCACTCACACTT
ACATTTATCTTCAACTTGTGACTATTCACAAGATGATTTTCCGTAGGGGA
TTACCTTCACAGGGTCAAAGACAATTTTAGTTGATGAATTTGAGGAAAAA
TTTGAAATATCTTGATTGTTGCCAAAAATTTCATCGCCGACCTTGATTAG
AATATTTTCTCAAATTTTCATTTCCTAAGTGTCTCTGTTTCACTCCCTTC
ATTCGAATGAAAAGTGGTGTTTTCAATTCGATGTAACTTCGTAGAAAAAA
GTTTTACTTGATGGTCATACAAACCTAATTCATTCTATTTTCGGTTATGC
TAACAAAAATATTTTATTATTTGTATTATTACAGAAAAATCTTGACCCCA
GAATCTCAGTTGAAGAAGTCTAAGGCTCAACAAAAGACTGCTGAACAAGT
CGCTGCTGAAAGAGCTGCTCGTAAGGCTGTATGTTAAACTTTTGCTTACA
TTTATATTTTATGCTGCAACACATTTTCTTACTGGGATATATGTGAGTTC
ACGTGCGTCGCCAATTTTGATAAGTATCATTCGAACAGTTGCAATATTCT
CAACACCTGTGGTATCATTAGGCTAATGGACTATTTGTTACGTTCTGTTT
GTTTTCTCATTTTATGATGAAAATCCATTGTATCCCAATACATATTCGAC
TAGTCATTGACGATGCTGTCGTAACTTATCACCATCTTTCGACTGATTAT
AAGAAAAGAAATAGAAAGTAAAATAGACCAGAGAGTTAAGGTATTAAGGT
TACAATTATTGCCGTGTTTCCAGTGTCAGAAAAACCATTGAGTGATAATG
CCGTGTAACAATTTTTAGCATAAAATACTGTGTTTTCCATCTGGTATTTT
TTTACTAACATCAATTTACTGTTTTTTACTTTTTATTTTCATTTAGGCTA
ACAAGGAAAAGAGAGCCATTATTTTGGAAAGAAACGCCGCTTACCAAAAG
GAATACGAAACTGCTGAAAGAAACATCATTCAAGCTAAGCGTGATGCCAA
GGCTGCTGGTTCCTACTACGTCGAAGCTCAACACAAGTTGGTCTTCGTTG
TCAGAATCAAGGGTATTAACAAGATCCCACCTAAGCCAAGAAAGGTTCTA
CAATTGCTAAGATTGACAAGAATCAACTCTGGTACATTCGTCAAAGTTAC
CAAGGCTACTTTGGAACTATTGAAGTTGATTGAACCATACGTTGCTTACG
GTTACCCATCGTACTCTACTATTAGACAATTGGTCTACAAGAGAGGTTTC
GGTAAGATCAACAAGCAAAGAGTTCCATTGTCCGACAATGCTATCATCGA
AGCCAACTTGGGTAAGTATGGTATCTTGTCCATTGACGATTTGATTCACG
AAATCATCACTGTTGGTCCACACTTCAAGCAAGCTAACAACTTTTTGTGG
CCATTCAAGTTGTCCAACCCATCTGGTGGTTGGGGTGTCCCAAGAAAGTT
CAAGCACTTTATCCAAGGTGGTTCTTTCGGTAACCGTGAAGAATTCATCA
ACAAATTGGTTAAGTCCATGAACTAAACTGGTTGAAGTCACATTACATAT
AACTACTTATTTTATCTAAGTCCTATATAGTTCTGTTATTCACTAGTTTG
TATATTGTACCTCTAATTTATCTTATAATTTCGTCAGTAATGACTTATAA
TTGGCTTTATTTTCTTCTCTTTTTTTTCACATGCGCAGAACCTA
 

What if I just want the coding sequence?

 

Below the coding sequence from SGD:

In [25]:
sgdseq = "ATGGCCGCTGAAAAAATCTTGACCCCAGAATCTCAGTTGAAGAAGT\
CTAAGGCTCAACAAAAGACTGCTGAACAAGTCGCTGCTGAAA\
GAGCTGCTCGTAAGGCTGCTAACAAGGAAAAGAGAGCCATTA\
TTTTGGAAAGAAACGCCGCTTACCAAAAGGAATACGAAACTG\
CTGAAAGAAACATCATTCAAGCTAAGCGTGATGCCAAGGCTG\
CTGGTTCCTACTACGTCGAAGCTCAACACAAGTTGGTCTTCG\
TTGTCAGAATCAAGGGTATTAACAAGATCCCACCTAAGCCAA\
GAAAGGTTCTACAATTGCTAAGATTGACAAGAATCAACTCTG\
GTACATTCGTCAAAGTTACCAAGGCTACTTTGGAACTATTGA\
AGTTGATTGAACCATACGTTGCTTACGGTTACCCATCGTACT\
CTACTATTAGACAATTGGTCTACAAGAGAGGTTTCGGTAAGA\
TCAACAAGCAAAGAGTTCCATTGTCCGACAATGCTATCATCG\
AAGCCAACTTGGGTAAGTATGGTATCTTGTCCATTGACGATT\
TGATTCACGAAATCATCACTGTTGGTCCACACTTCAAGCAAG\
CTAACAACTTTTTGTGGCCATTCAAGTTGTCCAACCCATCTG\
GTGGTTGGGGTGTCCCAAGAAAGTTCAAGCACTTTATCCAAG\
GTGGTTCTTTCGGTAACCGTGAAGAATTCATCAACAAATTGG\
TTAAGTCCATGAAC"
In [26]:
gtf.codingSequence(gene)
Out[26]:
'ATGGCCGCTGAAAAAATCTTGACCCCAGAATCTCAGTTGAAGAAGTCTAAGGCTCAACAAAAGACTGCTGAACAAGTCGCTGCTGAAAGAGCTGCTCGTAAGGCTGCTAACAAGGAAAAGAGAGCCATTATTTTGGAAAGAAACGCCGCTTACCAAAAGGAATACGAAACTGCTGAAAGAAACATCATTCAAGCTAAGCGTGATGCCAAGGCTGCTGGTTCCTACTACGTCGAAGCTCAACACAAGTTGGTCTTCGTTGTCAGAATCAAGGGTATTAACAAGATCCCACCTAAGCCAAGAAAGGTTCTACAATTGCTAAGATTGACAAGAATCAACTCTGGTACATTCGTCAAAGTTACCAAGGCTACTTTGGAACTATTGAAGTTGATTGAACCATACGTTGCTTACGGTTACCCATCGTACTCTACTATTAGACAATTGGTCTACAAGAGAGGTTTCGGTAAGATCAACAAGCAAAGAGTTCCATTGTCCGACAATGCTATCATCGAAGCCAACTTGGGTAAGTATGGTATCTTGTCCATTGACGATTTGATTCACGAAATCATCACTGTTGGTCCACACTTCAAGCAAGCTAACAACTTTTTGTGGCCATTCAAGTTGTCCAACCCATCTGGTGGTTGGGGTGTCCCAAGAAAGTTCAAGCACTTTATCCAAGGTGGTTCTTTCGGTAACCGTGAAGAATTCATCAACAAATTGGTTAAGTCCATGAAC'
In [27]:
gtf.codingSequence(gene) == sgdseq
Out[27]:
True
In [28]:
codingseq = gtf.codingSequence(gene,format="fasta",split=True)
print(codingseq)
 
>RPL7A
ATGGCCGCTGAAAAAATCTTGACCCCAGAATCTCAGTTGAAGAAGTCTAA
GGCTCAACAAAAGACTGCTGAACAAGTCGCTGCTGAAAGAGCTGCTCGTA
AGGCTGCTAACAAGGAAAAGAGAGCCATTATTTTGGAAAGAAACGCCGCT
TACCAAAAGGAATACGAAACTGCTGAAAGAAACATCATTCAAGCTAAGCG
TGATGCCAAGGCTGCTGGTTCCTACTACGTCGAAGCTCAACACAAGTTGG
TCTTCGTTGTCAGAATCAAGGGTATTAACAAGATCCCACCTAAGCCAAGA
AAGGTTCTACAATTGCTAAGATTGACAAGAATCAACTCTGGTACATTCGT
CAAAGTTACCAAGGCTACTTTGGAACTATTGAAGTTGATTGAACCATACG
TTGCTTACGGTTACCCATCGTACTCTACTATTAGACAATTGGTCTACAAG
AGAGGTTTCGGTAAGATCAACAAGCAAAGAGTTCCATTGTCCGACAATGC
TATCATCGAAGCCAACTTGGGTAAGTATGGTATCTTGTCCATTGACGATT
TGATTCACGAAATCATCACTGTTGGTCCACACTTCAAGCAAGCTAACAAC
TTTTTGTGGCCATTCAAGTTGTCCAACCCATCTGGTGGTTGGGGTGTCCC
AAGAAAGTTCAAGCACTTTATCCAAGGTGGTTCTTTCGGTAACCGTGAAG
AATTCATCAACAAATTGGTTAAGTCCATGAAC
 

Does the sequence translate properly?

In [29]:
protein = translate(gtf.codingSequence("RPL7A"))
print(protein)
 
MAAEKILTPESQLKKSKAQQKTAEQVAAERAARKAANKEKRAIILERNAAYQKEYETAERNIIQAKRDAKAAGSYYVEAQHKLVFVVRIKGINKIPPKPRKVLQLLRLTRINSGTFVKVTKATLELLKLIEPYVAYGYPSYSTIRQLVYKRGFGKINKQRVPLSDNAIIEANLGKYGILSIDDLIHEIITVGPHFKQANNFLWPFKLSNPSGGWGVPRKFKHFIQGGSFGNREEFINKLVKSMN
 

Now for some more complicated examples. What if I want to calculate overlap between interval coordinates and gtf feature intervals? I know my intervals map to chromosome I so what I need is a list of start and end positions for each gene on chromosome I and also the gene name:

In [30]:
coordinates = gtf.chromosomeGeneCoordIterator("I")
 

Just want to see the first ten coordinates:

In [31]:
coordinates[:10]
Out[31]:
[(335, 649, 'YAL069W'),
 (538, 792, 'YAL068W-A'),
 (1807, 2169, 'PAU8'),
 (2480, 2707, 'YAL067W-A'),
 (5075, 6237, 'SUT432'),
 (5236, 5887, 'XUT_1R-2'),
 (7235, 9016, 'SEO1'),
 (9368, 9600, 'SUT001'),
 (10091, 10399, 'YAL066W'),
 (10732, 11140, 'CUT436')]
 

OK but what if I know that my intervals are only on the '-' strand?

In [32]:
coordinates = gtf.chromosomeGeneCoordIterator("I",strand="-")
coordinates[:10]
Out[32]:
[(1807, 2169, 'PAU8'),
 (5075, 6237, 'SUT432'),
 (5236, 5887, 'XUT_1R-2'),
 (7235, 9016, 'SEO1'),
 (10732, 11140, 'CUT436'),
 (11565, 11951, 'YAL065C'),
 (13363, 13743, 'TDA8'),
 (17191, 17980, 'XUT_1R-5'),
 (22395, 22685, 'YAL063C-A'),
 (24000, 27968, 'FLO9')]
 

But actually I only care about protein coding genes:

In [33]:
coordinates = gtf.chromosomeGeneCoordIterator("I",strand="-",annotation="protein_coding")
coordinates[:10]
Out[33]:
[(1807, 2169, 'PAU8'),
 (7235, 9016, 'SEO1'),
 (11565, 11951, 'YAL065C'),
 (13363, 13743, 'TDA8'),
 (22395, 22685, 'YAL063C-A'),
 (24000, 27968, 'FLO9'),
 (36406, 36918, 'YAL059C-A'),
 (38696, 39046, 'YAL056C-A'),
 (42881, 45421, 'ACS1'),
 (51777, 52696, 'AIM2')]
 

To calculate overlap the GTF2 parsee uses numpy arrays:

In [34]:
coordinates = gtf.chromosomeGeneCoordIterator("I",strand="-",annotation="protein_coding",numpy=True)
coordinates[:10]
Out[34]:
array([( 1807,  2169, 'PAU8'), ( 7235,  9016, 'SEO1'),
       (11565, 11951, 'YAL065C'), (13363, 13743, 'TDA8'),
       (22395, 22685, 'YAL063C-A'), (24000, 27968, 'FLO9'),
       (36406, 36918, 'YAL059C-A'), (38696, 39046, 'YAL056C-A'),
       (42881, 45421, 'ACS1'), (51777, 52696, 'AIM2')],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<U100')])
 

Just to show you how it works an example below. Let's say I have an interval that maps to chrI on the '-' strand and I want to know what features overlap with this interval

In [35]:
intervalstart = 134343
intervalend   = 135904
 

Next I need to import the numpy_overlap method from the Methods file:

In [36]:
from pyCRAC.Methods import numpy_overlap
 

OK let's find some genes that overlap sense with this interval:

In [37]:
mygenecoordinates = gtf.chromosomeGeneCoordIterator("I",strand="-",numpy=True)
numpy_overlap(mygenecoordinates,intervalstart,intervalend)
Out[37]:
['MDM10']
 

This shows that my interval maps to an intergenic region (INT) and the MDM10 gene). What if I don't care on which strand the feature is located (i.e. also include anti-sense reads)?

In [38]:
mygenecoordinates = gtf.chromosomeGeneCoordIterator("I",strand=None,numpy=True)
numpy_overlap(mygenecoordinates,intervalstart,intervalend)
Out[38]:
['MDM10', 'SPO7']
 

Right so SPO7 is anti-sense to my gene. If I just wanted to know the anti-sense hits:

In [39]:
mygenecoordinates = gtf.chromosomeGeneCoordIterator("I",strand="+",numpy=True)
numpy_overlap(mygenecoordinates,intervalstart,intervalend)
Out[39]:
['SPO7']
 

But what if I want to have at least 100 nucleotide overlap with the features?

In [40]:
mygenecoordinates = gtf.chromosomeGeneCoordIterator("I",strand=None,numpy=True)
numpy_overlap(mygenecoordinates,intervalstart,intervalend,overlap=100)
Out[40]:
['MDM10']
 

What genes are there on chromosome I (only showing the first 20)?

In [41]:
gtf.chromosomesGenesList("I")[:20]
Out[41]:
['YAL069W',
 'YAL068W-A',
 'PAU8',
 'YAL067W-A',
 'SUT432',
 'XUT_1R-2',
 'SEO1',
 'SUT001',
 'YAL066W',
 'CUT436',
 'XUT_1F-1',
 'YAL065C',
 'YAL064W-B',
 'Unit1',
 'XUT_1F-3',
 'TDA8',
 'XUT_1F-4',
 'XUT_1R-5',
 'YAL064W',
 'YAL063C-A']
 

Actally I just want the protein coding genes:

In [42]:
gtf.chromosomesGenesList("I",annotation="protein_coding")[:20]
Out[42]:
['YAL069W',
 'YAL068W-A',
 'PAU8',
 'YAL067W-A',
 'SEO1',
 'YAL066W',
 'YAL065C',
 'YAL064W-B',
 'TDA8',
 'YAL064W',
 'YAL063C-A',
 'FLO9',
 'GDH3',
 'BDH2',
 'BDH1',
 'YAL059C-A',
 'ECM1',
 'CNE1',
 'YAL056C-A',
 'GPB2']
 

What about snoRNAs (showing all present)?

In [43]:
gtf.chromosomesGenesList("I",annotation="snoRNA")
Out[43]:
['snR18']
 

What about tRNAs? (sbowing all present)?

In [44]:
gtf.chromosomesGenesList("I",annotation="tRNA")
Out[44]:
['TRN1', 'tA(UGC)A', 'SUP56', 'tS(AGA)A']
 

Well I guess you get the idea by now...

 

The geneIterCoordinates class method is very useful for generating nucleotide pileups because it gives you the flexibility to add flanking sequences and/or remove introns. An example:

In [45]:
gtf.chromosomeCoordinates("MDM10")
Out[45]:
(134184, 135693)
In [46]:
nucleotidecoord = gtf.geneIterCoordinates("MDM10")
nucleotidecoord
Out[46]:
array([134184, 134185, 134186, ..., 135691, 135692, 135693])
 

Where does my interval overlap with the MDM10 coordinates? Numpy has this great function called 'where'.

In [47]:
intervalstart = 134143
intervalend   = 135604
np.where((intervalstart <=nucleotidecoord) & (intervalend >=nucleotidecoord))
Out[47]:
(array([   0,    1,    2, ..., 1418, 1419, 1420]),)
 

That gives you the indexes for the positions where there is overlap. But if I want to make a pileup, I have to keep track of where it overlaps:

In [48]:
pileup = np.where((intervalstart <=nucleotidecoord) & (intervalend >=nucleotidecoord),1,0)
pileup
Out[48]:
array([1, 1, 1, ..., 0, 0, 0])
 

We can use this trick to make a pileup for our gene of interest to make a plot for the number of reads that map to our favorite gene

 

Below an example. I want to make a pileup for all the intervals that mapped to RPL7A

In [49]:
nucleotidecoord = gtf.geneIterCoordinates("RPL7A")   # generate an array with chromosomal coordinates for each nucleotide in RPL7A
pileup = np.zeros(len(nucleotidecoord))              # make an array to collect the hits
start = np.random.randint(364249,366023,1000)        # make an array with start coordinates.
end   = start+30                                     # read length = 30 nt.
coordinates = zip(start,end)                         # merge them together                 
#print(list(coordinates))
for x,y in list(coordinates):
    pileup += np.where((x <=nucleotidecoord) & (y >=nucleotidecoord),1,0)
print(pileup)
 
[0. 0. 0. ... 0. 0. 0.]
 

So now I added another overlapping interval to the pileup. To print the pileup, I simply do the following:

In [50]:
genomicsequence = gtf.genomicSequence("RPL7A")
for pos,nuc in enumerate(genomicsequence):
    print("%s\t%s\t%s" % (pos+1,genomicsequence[pos],pileup[pos]))
 
1	C	0.0
2	T	0.0
3	T	0.0
4	T	0.0
5	G	0.0
6	A	0.0
7	C	0.0
8	A	0.0
9	A	0.0
10	C	0.0
11	T	0.0
12	T	0.0
13	C	0.0
14	A	0.0
15	T	0.0
16	C	0.0
17	C	0.0
18	T	0.0
19	T	0.0
20	T	0.0
21	C	0.0
22	T	0.0
23	G	0.0
24	A	0.0
25	A	0.0
26	A	0.0
27	T	0.0
28	A	0.0
29	T	0.0
30	T	0.0
31	G	0.0
32	A	0.0
33	A	0.0
34	A	0.0
35	T	0.0
36	A	0.0
37	A	0.0
38	T	0.0
39	C	0.0
40	A	0.0
41	A	0.0
42	G	0.0
43	A	0.0
44	C	0.0
45	A	0.0
46	C	0.0
47	T	0.0
48	A	0.0
49	G	0.0
50	A	0.0
51	T	0.0
52	C	0.0
53	T	0.0
54	G	0.0
55	C	0.0
56	C	0.0
57	T	0.0
58	T	0.0
59	G	0.0
60	T	0.0
61	A	0.0
62	T	0.0
63	T	0.0
64	G	0.0
65	T	0.0
66	C	0.0
67	T	0.0
68	T	0.0
69	C	0.0
70	G	0.0
71	G	0.0
72	T	0.0
73	T	0.0
74	C	0.0
75	A	0.0
76	T	0.0
77	A	0.0
78	T	0.0
79	A	0.0
80	A	0.0
81	A	0.0
82	T	0.0
83	C	1.0
84	C	1.0
85	A	4.0
86	A	4.0
87	A	4.0
88	T	4.0
89	A	4.0
-----
 

OK but what if I know that my reads only map to spliced RPL7A?

In [51]:
nucleotidecoord = gtf.geneIterCoordinates("RPL7A",coordinates="coding")   # generate an array with chromosomal coordinates for each nucleotide in RPL7A
pileup = np.zeros(len(nucleotidecoord))              # make an array to collect the hits
for start,end in coordinates:
    pileup += np.where((start <=nucleotidecoord) & (end >=nucleotidecoord),1,0)