Hi Jen,
Thank you very much for your reply.
The file contains more than 5000 transcripts so I don't pull out data
per transcript.
I do as you told and make sure the format. I filter the gff file to
get a new file only containing exons information (I was wrong
yesterday because I used the raw gtf file as I told in the former
mail), then convert gtf to bed . So I can use (Extract Features)->Gene
BED To Exon/Intron/Codon BED to get a bed file containing introns such
like this:
1 9162341 9162884 CUFF.1911.1 0 -
1 22819814 22826251 CUFF.5109.1 0 +
1 25887852 25895755 CUFF.5509.1 0 -
1 25895822 25902258 CUFF.5509.1 0 -
1 39783161 39786032 CUFF.8086.1 0 +
Then I met another problem: I got an empty file when I used Extract
Genomic DNA to fetch sequence whether the file was gtf format or not.
It returned a right result while I used the bed file downloaded from
UCSC main. I think I have checked the format, but I found nothing
wrong.
the data downloaded from UCSC main is like this:
chr1 133903980 133904133 NM_214429_exon_0_0_chr1_133903981_f 0 +
chr1 133914112 133914267 NM_214429_exon_1_0_chr1_133914113_f 0 +
chr1 133917280 133917449 NM_214429_exon_2_0_chr1_133917281_f 0 +
Then I suddenly found the problem when I was trying to explain it. The
input file of the tool (Extract Genomic DNA) request the condition of
the chromosome name which should be ,for example, 'chr1' rather than
'1' .
I have tackled it all day .It is really low deficient when there is
not anybody instructing in face to face.
Best,
John
To: Jennifer Jackson
Subject: Re: [galaxy-user] Question about Extract intron sequences
from [gtf file] + [genome FASTA file]
Hi Jen,
Thank you very much for your reply.
The file contains more than 5000 transcripts so I don't pull out data
per transcript .
I do as you say and make sure the format. I filter the gff file to get
a new file only containing exons information (I was wrong yesterday
because I used the raw gtf file as I told in the former mail), then
convert gtf to bed . So I can use (Extract Features)->Gene BED To
Exon/Intron/Codon BED to get a bed file containing introns such like
this:
1 9162341 9162884 CUFF.1911.1 0 -
1 22819814 22826251 CUFF.5109.1 0 +
1 25887852 25895755 CUFF.5509.1 0 -
1 25895822 25902258 CUFF.5509.1 0 -
1 39783161 39786032 CUFF.8086.1 0 +
Then I met another problem: I got an empty file when I used Extract
Genomic DNA to fetch sequence whether the file was gtf format or not.
It returned a right result while I used the bed file downloaded from
UCSC main. I think I have checked the format, but I found nothing
wrong.
the data downloaded from UCSC main is like this:
chr1 133903980 133904133 NM_214429_exon_0_0_chr1_133903981_f 0 +
chr1 133914112 133914267 NM_214429_exon_1_0_chr1_133914113_f 0 +
chr1 133917280 133917449 NM_214429_exon_2_0_chr1_133917281_f 0 +
I have tackled it all day .It is really of low deficiency when there
is not anybody instructing in face to face. So I need some of your
tips.
Best,
John
To: 师云
Cc: galaxy-user@lists.bx.psu.edu
Subject: Re: [galaxy-user] Question about Extract intron sequences
from [gtf file] + [genome FASTA file]
Hello,
There appears to be something odd with the formatting of the GTF file
- the exon counts are off in the second transcript's first exon. The
exon_number "1" should be "2" (remember to count reverse, is on the
negative strand). But that is a side issue. There are other things
that do not quite make sense, but the entire dataset was not shared.
Run this again, but do the following:
1 - make sure the files are in interval format and that the column
assignments are correct (click on the pencil icon)
2 - Use strand assignment or better, separate (+) and (-) stranded
transcripts into two files, at the start and run the query in two
workflows from there. Some GOPS tools work best this way.
Also, be aware that some of these transcripts will not have intron
output. For example, the first transcript in your example is a single
exon transcript. Also, if you have genes with overlapping variant
transcripts, these will interfere with the query (you will lose
introns or fractions of introns), but I don't know how large of a
dataset you are working with. If you want to pull out data per
transcript, the tools in the group "Filter and Sort" can be used to
subset GFF/GTF files.
The last query that you ran is the ideal way to run to obtain this
information in Galaxy, but the GFF to BED converter creates a BED6,
not a BED12 file, and this is why the tool produced no output (see the
tool form for required input). Having this tool accept GTF formatted
input might be something to consider as an enhancement - I will run it
by our development team and open a Trello ticket as appropriate.
Another method, which may not be available to you, (from looking at
the chromosome identifiers - these are not UCSC chrom IDs) -- but
could help in the future or others now, is to use the UCSC Table
browser. It goes something like this:
1 - Click on "display at UCSC Main" for a GTF dataset, this loads the
data as a custom track, default display in assembly viewer
2 - Once in UCSC, at the top bar, pick Tools -> Table Browser
3 - In the Table Browser, change track group to "Custom Tracks" and
the user track you just loaded will be there
4 - Change region = genome, then output = bed, and make sure "Send
output to Galaxy" is checked, submit
5 - On the next form, you will be given a list of regions to output in
the BED6 output, Introns are one of them
Best,
Jen
Galaxy team
Dear Jen,
I am not much of a Galaxy user yet. Some days ago I know something
about Galaxy and found it a really wonderful tool. And I am confused
by a simple question regarding how to extract intron sequences from
[gtf file];
Here is a simple of a gtf file:
1 Cufflinks transcript 3 22 1000 + . gene_id "CUFF.26";
transcript_id "CUFF.26.1";
1 Cufflinks exon 3 22 1000 + . gene_id "CUFF.26";
transcript_id "CUFF.26.1"; exon_number "1";
1 Cufflinks transcript 10 40 1000 - . gene_id "CUFF.204";
transcript_id "CUFF.204.1";
1 Cufflinks exon 10 15 1000 - . gene_id "CUFF.204";
transcript_id "CUFF.204.1"; exon_number "1";
1 Cufflinks exon 30 40 1000 - . gene_id "CUFF.204";
transcript_id "CUFF.204.1"; exon_number "1";
I want to extract intron from the [gtf] file. I found 2 ways may
solve the question but it is both useless;
1. I use (Filter and Sort) -> Filter to cut the [gtf] file into 2
files such as the follows:
File A ( contain transcript ):
1 Cufflinks transcript 3 22 1000 + . gene_id "CUFF.26";
transcript_id "CUFF.26.1";
1 Cufflinks transcript 10 40 1000 - . gene_id "CUFF.204";
transcript_id "CUFF.204.1";
File B ( contain exon):
1 Cufflinks exon 3 22 1000 + . gene_id "CUFF.26";
transcript_id "CUFF.26.1"; exon_number "1";
1 Cufflinks exon 10 15 1000 - . gene_id "CUFF.204";
transcript_id "CUFF.204.1"; exon_number "1";
1 Cufflinks exon 30 40 1000 - . gene_id "CUFF.204";
transcript_id "CUFF.204.1"; exon_number "1";
Then I use (Operate on Genomic Intervals)->Subtract to subtract File
B from File A Return Non-overlapping pieces of intervals. I thought it
will return a file containing intron But the result is an empty file;
2. I convert [gtf] file to [Bed] file ,and use (Extract
Features)->Gene BED To Exon/Intron/Codon BED, and it return the same
result, an empty file.
I think it must be something wrong with my thoughts. So I really
need your help. Thank you very much.
sincerely yours,
John
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org. Please keep all replies on the list by
using "reply all" in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/
--
Jennifer Hillman-Jackson
http://galaxyproject.org