Extract Alignment For A Set Of Genes

Question: Extract Alignment For A Set Of Genes

7.5 years ago by

To Whom It May Concern, Sorry to bother you with what is likely a fairly simple problem, but I have trying to figure this out myself for several days and just can't figure out how to do it. I have a set of 8766 genes that I would like to test for positive selection in using various other programs (HyPhy for example). To do this I obviously need an alignment of these genes across various species, but I just can't figure out how to get the alignment in a fasta format. For example, I have a BED12 file from UCSC with the data for the 8766 genes, I thought the easiest way was to use the "Stitch Gene blocks" option and then select locally cached alignments as the MAF source for the species I care about. However, because these 8766 genes have multiple transcripts I end up with 23,581 regions. Is there a way to merge the multiple regions for each gene into a single region for the longest transcript? Then I should have 8766 regions and can use Stitch Gene blocks". (Unless there is a more economical way to do this.)\ Thanks Vinny Vincent J. Lynch, Associate Research Scientist Department of Ecology and Evolutionary Biology & Yale Systems Biology Institute Yale University http://pantheon.yale.edu/~vjl4/profpage/ "There is a grandeur in this view of life, with its several powers, having been originally breathed into a few forms or into one; and that whilst this planet has gone on cycling according to the fixed laws of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved." -C. Darwin, 1859 (Walker, Wisconsin, Madison, Maddow, Tea Party, Obama, global warming)

• 962 views

ADD COMMENT • link •

modified 7.5 years ago by Jennifer Hillman Jackson ♦ 25k • written 7.5 years ago by Vincent Joseph Lynch • 40

7.5 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Vinny, One option is to filter for a single representative transcript in your BED file from UCSC as a first step or to use that sort of list as a filter for your final result (if the data is still labeled by transcriptIDs). If using the "UCSC Genes" track, the table is called "knownCanonical". Another option is to consider the tools in "Operate on Genomic Intervals" and to if any meet your criteria. https://bitbucket.org/galaxy/galaxy-central/wiki/GopsDesc Merge or Cluster may be what you want. Note: this can result in gene models that are not represented by a single transcript in the primary query species. If you have more questions, please let us know, and kindly keep the cc to galaxy-user so that the Galaxy team and community can offer input, Best, Jen Galaxy team -- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org

ADD COMMENT • link written 7.5 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »