Hi Galaxy people,
I have transcripts predicted by Cufflinks that are in a gtf
file. How can I extract the sequences corresponding to those
transcripts, using Galaxy?
[Cufflinks transcript predictions in gtf file] + [Genome
sequence in FASTA file] ---> [FASTA file of transcript sequences]
My genome is a custom genome (not at UCSC).
I'll also need to do the same thing, except my predicted
transcripts are in a Scripture bed file.
Thanks for your help!
Karen Tang :)
University of Minnesota
The following general workflow should help you to pull sequences from
1) cut out the sequence IDs from the query (in this case, a GTF & BED
file) and sort them.
Text Manipulation -> Cut columns from a table
Filter and Sort -> Sort
2) convert the target fasta file to tabular format
Convert Formats -> FASTA-to-Tabular converter
3) join the two datasets based on the sequence ID
Join, Subtract and Group -> Join two Queries
4) covert to fasta
Convert Formats -> Tabular-to-FASTA
5) when starting with a GTF file, there will most likely be
To remove, use:
NGS: QC and manipulation -> Collapse sequences
Once you create the actual workflow that performs the job, be sure to
save it so that you can just re-use it whenever you need to perform
same task. To do this, from the history pane (most right) use Options
Extract workflow and following the instructions on the form to
Hopefully this helps,
Subject: Re: [galaxy-user] Extract sequences from [gtf file] + [genome
Date: Thu, 27 Jan 2011 17:23:11 -0700
To: Jennifer Jackson <email@example.com>
I am not much of a Galaxy user yet, but a long time user of
other databases and sequence analysis tools (Phylogenetics software,
A common task I would like to do, is obtain a FASTA format file
aligned, but I can do the alignment later very easily) of the regions
sequences hit in a BLAST search on GenBank.
It is easy to ask GenBank to give me all (or the selected few)
hit in the BLAST search, but not so easy to get each sequence
the matched region. For example, if I search with the D-loop region
mammal mitochondrial genome, I would like to get that region clipped
all the hundreds of complete mitochondrial genomes. Or if I search
mammalian endogenous retrovirus, get the retroviruses clipped from the
complete chromosome entries.
Ideally, I would add one more criteria. I would add that I would
to be able to get some number of bases (lets say 100) flanking the
region. So I could capture the integration sites of endogenous
retroviruses, for example. Or get the intron flanks of a gene if I
searching with a mammalian gene exon.
The final thing would be to deal with the fact that GenBank BLAST
results often get fragmented. For example the LTRs of retroviruses
(endogenous or not) create a problem. And any large in/dels or highly
variable regions often split one contiguous homologous string into two
individual matches split at the in/del or variable site.
This looks somewhat similar to the task you describe below, so I
wondering if it is something I can do in Galaxy (or with Galaxy plus a
GenBank/BLAST will almost give me what I want. The trouble I
that either I can get the result as a multiple sequence alignment but
useless sequence names (just the gi number for identifier) and not in
format, or I can get full sequence entries but not the matched region
clipped out. I have asked NCBI/GenBank if they would serve up the
in FASTA format, but they are not responsive on that.
Brian Foley PhD
Please send questions such as these to the galaxy-user mailing list
I've cc'd). You can find the 'Extract Genomic DNA' tool under the
Sequences' menu. You may also want to use tool search ('Options ->
I just read through the post at the following
I'm facing the same problem as well.
I'm desired to extract out the assembled transcript by Cufflink.
Can I know that how I link my output file from Tophat and Cufflink
I'm having the following output file right now:
I got a bit confusing about the explanation below:
" in order to get the sequence data for transcripts in a Cuff* GTF
want to select for only exons (use Galaxy's 'Extract Features' tool)
use the resultant dataset as input to Extract."
Thanks a lot for advice.