Extract Sequences From [Gtf File] + [Genome Fasta File]

Question: Extract Sequences From [Gtf File] + [Genome Fasta File]

7.8 years ago by

Karen Tang • 40 wrote:

Hi Galaxy people, I have transcripts predicted by Cufflinks that are in a gtf file. How can I extract the sequences corresponding to those transcripts, using Galaxy? [Cufflinks transcript predictions in gtf file] + [Genome sequence in FASTA file] ---> [FASTA file of transcript sequences] My genome is a custom genome (not at UCSC). I'll also need to do the same thing, except my predicted transcripts are in a Scripture bed file. Thanks for your help! Karen Tang :) Plant Biology University of Minnesota

rna-seq cufflinks • 9.3k views

ADD COMMENT • link •

modified 7.6 years ago by Edge Edge • 10 • written 7.8 years ago by Karen Tang • 40

7.8 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello Karen, The following general workflow should help you to pull sequences from any source. 1) cut out the sequence IDs from the query (in this case, a GTF & BED file) and sort them. Text Manipulation -> Cut columns from a table Filter and Sort -> Sort 2) convert the target fasta file to tabular format Convert Formats -> FASTA-to-Tabular converter 3) join the two datasets based on the sequence ID Join, Subtract and Group -> Join two Queries 4) covert to fasta Convert Formats -> Tabular-to-FASTA 5) when starting with a GTF file, there will most likely be duplicates. To remove, use: NGS: QC and manipulation -> Collapse sequences Once you create the actual workflow that performs the job, be sure to save it so that you can just re-use it whenever you need to perform the same task. To do this, from the history pane (most right) use Options -> Extract workflow and following the instructions on the form to customize. Hopefully this helps, Jen Galaxy team -- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org

ADD COMMENT • link written 7.8 years ago by Jennifer Hillman Jackson ♦ 25k

I was thinking of something different. Here is a example of a three-exon transcript, in gtf format: contig00035 Cufflinks transcript 3 22 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; contig00035 Cufflinks exon 3 10 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "1"; contig00035 Cufflinks exon 13 18 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "2"; contig00035 Cufflinks exon 20 22 1000 + . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "3"; and the genome sequence that the transcript comes from is: GTAGCGTCTCCGACGCGGATATGACCGCACGCTGATGCTCCCAGGGATGAGAGGCGTGCG I want the sequence for this transcript: I want to extract from the genome sequence the subsequences for positions 3-10, 13-18, and 20-22, and then concatenate the three subsequences to create the transcript sequence. In this case, it would be AGCGTCTC + ACGCGG + TAT, meaning the transcript sequence would be AGCGTCTCACGCGGTAT. Is it possible to do this in Galaxy? Karen :)

ADD REPLY • link written 7.8 years ago by Karen Tang • 40

Hi Karen, I just implemented this functionality in Galaxy's 'Extract Genomic DNA' tool. This functionality will be available on our main server in the next couple weeks and is available now via our development repository ( bitbucket.org/galaxy/galaxy-central/ ) One note: GTF files produced by Cuff* are unusual in that, for each assembled transcript, they include a "transcript" element in additional to exons. This element is problematic because it spans the entire transcript. Hence, in order to get the sequence data for transcripts in a Cuff* GTF file, you'll want to select for only exons (use Galaxy's 'Extract Features' tool) and then use the resultant dataset as input to Extract. Let us know if you have any questions. Thanks, J.

ADD REPLY • link written 7.8 years ago by Jeremy Goecks • 2.2k

Edge, Please send questions like this to the galaxy-user mailing list, where many people see your email and can help you and/or benefit from it. I've cc'd the list for this reply. The thread you linked to is out of date. To get sequences for the features in a GTF file, you can use the 'Extract Genomic DNA' tool and set the option 'Interpret features when possible' to Yes. To get sequences for Cufflinks transcripts, use the transcripts.gtf as input to the tool. Best, J.

ADD REPLY • link written 7.6 years ago by Jeremy Goecks • 2.2k

7.8 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Subject: Re: [galaxy-user] Extract sequences from [gtf file] + [genome FASTA file] Date: Thu, 27 Jan 2011 17:23:11 -0700 To: Jennifer Jackson <jen@bx.psu.edu> Dear Jen, I am not much of a Galaxy user yet, but a long time user of GenBank and other databases and sequence analysis tools (Phylogenetics software, etc). A common task I would like to do, is obtain a FASTA format file (ideally aligned, but I can do the alignment later very easily) of the regions of sequences hit in a BLAST search on GenBank. It is easy to ask GenBank to give me all (or the selected few) sequences hit in the BLAST search, but not so easy to get each sequence "clipped" to the matched region. For example, if I search with the D-loop region of a mammal mitochondrial genome, I would like to get that region clipped out of all the hundreds of complete mitochondrial genomes. Or if I search with a mammalian endogenous retrovirus, get the retroviruses clipped from the complete chromosome entries. Ideally, I would add one more criteria. I would add that I would like to be able to get some number of bases (lets say 100) flanking the matched region. So I could capture the integration sites of endogenous retroviruses, for example. Or get the intron flanks of a gene if I was searching with a mammalian gene exon. The final thing would be to deal with the fact that GenBank BLAST match results often get fragmented. For example the LTRs of retroviruses (endogenous or not) create a problem. And any large in/dels or highly variable regions often split one contiguous homologous string into two individual matches split at the in/del or variable site. This looks somewhat similar to the task you describe below, so I am wondering if it is something I can do in Galaxy (or with Galaxy plus a few other tools). GenBank/BLAST will almost give me what I want. The trouble I find is that either I can get the result as a multiple sequence alignment but with useless sequence names (just the gi number for identifier) and not in FASTA format, or I can get full sequence entries but not the matched region clipped out. I have asked NCBI/GenBank if they would serve up the results in FASTA format, but they are not responsive on that. Brian Foley PhD HIV Databases btf@lanl.gov http://www.hiv.lanl.gov

ADD COMMENT • link written 7.8 years ago by Jennifer Hillman Jackson ♦ 25k

7.8 years ago by

Jeremy Goecks • 2.2k

Jeremy Goecks • 2.2k wrote:

Dinesh, Please send questions such as these to the galaxy-user mailing list (which I've cc'd). You can find the 'Extract Genomic DNA' tool under the 'Fetch Sequences' menu. You may also want to use tool search ('Options -> Show Tool Search'). Thanks, J.

ADD COMMENT • link written 7.8 years ago by Jeremy Goecks • 2.2k

7.6 years ago by

Edge Edge • 10

Edge Edge • 10 wrote:

Hi, I just read through the post at the following link, http://lists.bx.psu.edu/pipermail/galaxy- user/2011-February/001934.html I'm facing the same problem as well. I'm desired to extract out the assembled transcript by Cufflink. Can I know that how I link my output file from Tophat and Cufflink with the Galaxy? I'm having the following output file right now: junctions.bed insertions.bed deletions.bed accepted_hits.bam human_reference_genome.fasta transcripts.gtf isoforms.fpkm_tracking genes.fpkm_tracking I got a bit confusing about the explanation below: " in order to get the sequence data for transcripts in a Cuff* GTF file, you'll want to select for only exons (use Galaxy's 'Extract Features' tool) and then use the resultant dataset as input to Extract." Thanks a lot for advice. best regards edge

ADD COMMENT • link written 7.6 years ago by Edge Edge • 10

Please log in to add an answer.

Similar posts • Search »