Question

Question: Chip-Seq Data Analysis Question

0

6.5 years ago by

Hello, My name is Christopher Terranova and am a M.S student at the University of Buffalo SUNY.I have been attempting to analyze my MACS data using Galaxy, already have my custom peaks on the UCSC Genome browser and have some specific questions. I am attempting to show how my peaks (and peak center coordinates) relate to gene units(+/-TSS and Genic) and intergenic regions specifically. I have been attempting to do this two different ways and am not sure if I am doing this correctly. Below I will list the steps I have been using with particular questions highlighted near my problem. I would also like to apologize for this extended e-mail, however, I have only been working with Galaxy for approx a month and attempting to figure all the manipulations is kind of difficult. If some can answer my questions I would greatly appreciate it!!! These questions relate specifically to promoters- 1.Retrieving TSS coordinates 1.Go to the UCSC genome browser, click "Tables" in the top of the page, and select mouse mm9 as the organism 2.select "RefSeq genes" in tracks, BED as the "output format" and check "Send output to galaxy" 3.click "Get output" then "Send output to galaxy", and you are redirected to your Galaxy account, which contains an additional dataset 4.use the galaxy "Filter" tool (left column) to select all "+" strand genes 5.use the "Cut" tool (left column) to extract columns 1,2,2,4,5,6 (**is the c2 column repeated twice??**) in order to build a BED file containing the TSS for all "+" strand genes 6.do the same for the genes on the "-" strand Computing peak center coordinates 1.In Galaxy, select the tool "Compute expression on every row" in the left column (Text manipulation section) 2.as an expression, select c2+(c3-c2+1)/2, round result "YES" 3.select the dataset containing the peaks for one of the TFs (HNF4a or CBPA), and click "execute"; this creates a new dataset with an additional column containing the coordinate of the peak center. 4.now select the tool "Cut", and extract the columns c1,c6,c6,c4,c5(**is the c6 column repeated twice??**) to create a new BED file containing the peak center 5.edit the metadata of this new dataset (clicking on the small pencil icon), and change the format to BED Computing distance to closest TSS 1.select the tool "Fetch closest non-overlapping feature", select the new dataset containing the peak center coordinates, and the dataset containing the mouse TSS. A new dataset is created containing for each peak, the closest TSS 2.compute the distance from the peak center to the closest TSS using the "Compute expression on every row" tool(**what expression should I use to do this**) 3.plot the distribution using the "Histogram of a numeric column" tool. Secondary way: I understand this is not identifying the peak center closest to the TSS or a particular strand, however, still have a couple questions? Now we have a data set corresponding to all human RefSeqs (34,765) and we want to convert this set into one corresponding to human promoter regions. First, we will make sure our data set just contains the start and end coordinates of the genes. Select the "Text Manipulation" tool and then "Cut" colums from a table. Set "cut columns" to "c1,c2,c3,c4,c6" (**Is this the right c1... conformation??**). Make sure our previously downloaded RefSeq tdat set is selected and click on "Execute". When this is finished, click on the pencil icon to assign names to the columns. Set name to "RefSeqs", click "save" and change the data type to "interval" and click "save". Now click the pencil icon again to define the columns. Set the start column to "2" and the end column to "3", the strand column to "5" and the "Name/Identifier" column to "4" and click on "save". Now, go to the "Operate on Genomic Intervals" section of the "Tools" menu and select "Get flanks" to get the flanking regions for the RefSeq data set we just created. Make sure our RefSeq data set is selected and we want to get the "upstream" flanking regions for this data set. Set the length of the flanking region to 1000 to get the coordinates for 1kb upstream. Later on we could use different intervals. Click on "Execute". When this has finished, go to "Operate on Genomic Intervals" again and select "Join". Now set "First query" to "Get flanks.." and "Second query" to the peaks file of the "MACS" output and then click on "Execute". We now end up with 710 regions where our ChIP-Seq peaks overlap with our 1kb upstream region (promoter region). Lastly, while not discussed here, what exactly does the offset command do when getting flanks? Thank you very much and again, I apologize for the extensive questions! Sincerely, Christopher Terranova

macs • 1.4k views

ADD COMMENT • link •

written 6.5 years ago by cjt5@buffalo.edu • 10

Similar posts • Search »