Question: Chip-Seq Data Analysis Question
0
cjt5@buffalo.edu • 10 wrote:
Hello,
My name is Christopher Terranova and am a M.S student at the
University of
Buffalo SUNY.I have been attempting to analyze my MACS data using
Galaxy, already
have my custom peaks on the UCSC Genome browser and have some specific
questions.
I am attempting to show how my peaks (and peak center coordinates)
relate to gene
units(+/-TSS and Genic) and intergenic regions specifically. I have
been
attempting to do this two different ways and am not sure if I am doing
this
correctly. Below I will list the steps I have been using with
particular
questions highlighted near my problem. I would also like to apologize
for this
extended e-mail, however, I have only been working with Galaxy for
approx a month
and attempting to figure all the manipulations is kind of difficult.
If some can
answer my questions I would greatly appreciate it!!!
These questions relate specifically to promoters-
1.Retrieving TSS coordinates
1.Go to the UCSC genome browser, click "Tables" in the top of the
page, and
select mouse mm9 as the organism
2.select "RefSeq genes" in tracks, BED as the "output format" and
check "Send
output to galaxy"
3.click "Get output" then "Send output to galaxy", and you are
redirected to
your Galaxy account, which contains an additional dataset
4.use the galaxy "Filter" tool (left column) to select all "+"
strand genes
5.use the "Cut" tool (left column) to extract columns 1,2,2,4,5,6
(**is the
c2 column repeated twice??**) in order to build a BED file
containing the TSS
for all "+" strand genes
6.do the same for the genes on the "-" strand
Computing peak center coordinates
1.In Galaxy, select the tool "Compute expression on every row" in
the left
column (Text manipulation section)
2.as an expression, select c2+(c3-c2+1)/2, round result "YES"
3.select the dataset containing the peaks for one of the TFs
(HNF4a or CBPA),
and click "execute"; this creates a new dataset with an additional
column
containing the coordinate of the peak center.
4.now select the tool "Cut", and extract the columns
c1,c6,c6,c4,c5(**is the
c6 column repeated twice??**) to create a new BED file containing the
peak center
5.edit the metadata of this new dataset (clicking on the small
pencil icon),
and change the format to BED
Computing distance to closest TSS
1.select the tool "Fetch closest non-overlapping feature", select
the new
dataset containing the peak center coordinates, and the dataset
containing the
mouse TSS. A new dataset is created containing for each peak, the
closest TSS
2.compute the distance from the peak center to the closest TSS
using the
"Compute expression on every row" tool(**what expression should I use
to do this**)
3.plot the distribution using the "Histogram of a numeric column"
tool.
Secondary way: I understand this is not identifying the peak center
closest to
the TSS or a particular strand, however, still have a couple
questions?
Now we have a data set corresponding to all human RefSeqs (34,765) and
we want to
convert this set into one corresponding to human promoter regions.
First, we will
make sure our data set just contains the start and end coordinates of
the genes.
Select the "Text Manipulation" tool and then "Cut" colums from a
table. Set "cut
columns" to "c1,c2,c3,c4,c6" (**Is this the right c1...
conformation??**). Make
sure our previously downloaded RefSeq tdat set is selected and click
on
"Execute". When this is finished, click on the pencil icon to assign
names to the
columns. Set name to "RefSeqs", click "save" and change the data type
to
"interval" and click "save". Now click the pencil icon again to define
the
columns. Set the start column to "2" and the end column to "3", the
strand column
to "5" and the "Name/Identifier" column to "4" and click on "save".
Now, go to
the "Operate on Genomic Intervals" section of the "Tools" menu and
select "Get
flanks" to get the flanking regions for the RefSeq data set we just
created. Make
sure our RefSeq data set is selected and we want to get the "upstream"
flanking
regions for this data set. Set the length of the flanking region to
1000 to get
the coordinates for 1kb upstream. Later on we could use different
intervals.
Click on "Execute". When this has finished, go to "Operate on Genomic
Intervals"
again and select "Join". Now set "First query" to "Get flanks.." and
"Second
query" to the peaks file of the "MACS" output and then click on
"Execute". We now
end up with 710 regions where our ChIP-Seq peaks overlap with our 1kb
upstream
region (promoter region).
Lastly, while not discussed here, what exactly does the offset command
do when
getting flanks?
Thank you very much and again, I apologize for the extensive
questions!
Sincerely,
Christopher Terranova