Question About Extract Intron Sequences From [Gtf File] + [Genome Fasta File]

Question: Question About Extract Intron Sequences From [Gtf File] + [Genome Fasta File]

5.3 years ago by

师云 • 110

师云 • 110 wrote:

Dear Jen, I am not much of a Galaxy user yet. Some days ago I know something about Galaxy and found it a really wonderful tool. And I am confused by a simple question regarding how to extract intron sequences from [gtf file]; Here is a simple of a gtf file: 1 Cufflinks transcript 3 22 1000 + . gene_id "CUFF.26"; transcript_id "CUFF.26.1"; 1 Cufflinks exon 3 22 1000 + . gene_id "CUFF.26"; transcript_id "CUFF.26.1"; exon_number "1"; 1 Cufflinks transcript 10 40 1000 - . gene_id "CUFF.204"; transcript_id "CUFF.204.1"; 1 Cufflinks exon 10 15 1000 - . gene_id "CUFF.204"; transcript_id "CUFF.204.1"; exon_number "1"; 1 Cufflinks exon 30 40 1000 - . gene_id "CUFF.204"; transcript_id "CUFF.204.1"; exon_number "1"; I want to extract intron from the [gtf] file. I found 2 ways may solve the question but it is both useless; 1. I use (Filter and Sort) -> Filter to cut the [gtf] file into 2 files such as the follows: File A ( contain transcript ): 1 Cufflinks transcript 3 22 1000 + . gene_id "CUFF.26"; transcript_id "CUFF.26.1"; 1 Cufflinks transcript 10 40 1000 - . gene_id "CUFF.204"; transcript_id "CUFF.204.1"; File B ( contain exon): 1 Cufflinks exon 3 22 1000 + . gene_id "CUFF.26"; transcript_id "CUFF.26.1"; exon_number "1"; 1 Cufflinks exon 10 15 1000 - . gene_id "CUFF.204"; transcript_id "CUFF.204.1"; exon_number "1"; 1 Cufflinks exon 30 40 1000 - . gene_id "CUFF.204"; transcript_id "CUFF.204.1"; exon_number "1"; Then I use (Operate on Genomic Intervals)->Subtract to subtract File B from File A Return Non-overlapping pieces of intervals. I thought it will return a file containing intron But the result is an empty file; 2. I convert [gtf] file to [Bed] file ,and use (Extract Features)->Gene BED To Exon/Intron/Codon BED, and it return the same result, an empty file. I think it must be something wrong with my thoughts. So I really need your help. Thank you very much. sincerely yours, John

rna-seq cufflinks • 3.6k views

ADD COMMENT • link •

modified 5.3 years ago • written 5.3 years ago by 师云 • 110

5.3 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello, There appears to be something odd with the formatting of the GTF file - the exon counts are off in the second transcript's first exon. The exon_number "1" should be "2" (remember to count reverse, is on the negative strand). But that is a side issue. There are other things that do not quite make sense, but the entire dataset was not shared. Run this again, but do the following: 1 - make sure the files are in interval format and that the column assignments are correct (click on the pencil icon) 2 - Use strand assignment or better, separate (+) and (-) stranded transcripts into two files, at the start and run the query in two workflows from there. Some GOPS tools work best this way. Also, be aware that some of these transcripts will not have intron output. For example, the first transcript in your example is a single exon transcript. Also, if you have genes with overlapping variant transcripts, these will interfere with the query (you will lose introns or fractions of introns), but I don't know how large of a dataset you are working with. If you want to pull out data per transcript, the tools in the group "Filter and Sort" can be used to subset GFF/GTF files. The last query that you ran is the ideal way to run to obtain this information in Galaxy, but the GFF to BED converter creates a BED6, not a BED12 file, and this is why the tool produced no output (see the tool form for required input). Having this tool accept GTF formatted input might be something to consider as an enhancement - I will run it by our development team and open a Trello ticket as appropriate. Another method, which may not be available to you, (from looking at the chromosome identifiers - these are not UCSC chrom IDs) -- but could help in the future or others now, is to use the UCSC Table browser. It goes something like this: 1 - Click on "display at UCSC Main" for a GTF dataset, this loads the data as a custom track, default display in assembly viewer 2 - Once in UCSC, at the top bar, pick Tools -> Table Browser 3 - In the Table Browser, change track group to "Custom Tracks" and the user track you just loaded will be there 4 - Change region = genome, then output = bed, and make sure "Send output to Galaxy" is checked, submit 5 - On the next form, you will be given a list of regions to output in the BED6 output, Introns are one of them Best, Jen Galaxy team -- Jennifer Hillman-Jackson http://galaxyproject.org

ADD COMMENT • link written 5.3 years ago by Jennifer Hillman Jackson ♦ 25k

5.3 years ago by

师云 • 110

师云 • 110 wrote:

Hi Jen, Thank you very much for your reply. The file contains more than 5000 transcripts so I don't pull out data per transcript. I do as you told and make sure the format. I filter the gff file to get a new file only containing exons information (I was wrong yesterday because I used the raw gtf file as I told in the former mail), then convert gtf to bed . So I can use (Extract Features)->Gene BED To Exon/Intron/Codon BED to get a bed file containing introns such like this: 1 9162341 9162884 CUFF.1911.1 0 - 1 22819814 22826251 CUFF.5109.1 0 + 1 25887852 25895755 CUFF.5509.1 0 - 1 25895822 25902258 CUFF.5509.1 0 - 1 39783161 39786032 CUFF.8086.1 0 + Then I met another problem: I got an empty file when I used Extract Genomic DNA to fetch sequence whether the file was gtf format or not. It returned a right result while I used the bed file downloaded from UCSC main. I think I have checked the format, but I found nothing wrong. the data downloaded from UCSC main is like this: chr1 133903980 133904133 NM_214429_exon_0_0_chr1_133903981_f 0 + chr1 133914112 133914267 NM_214429_exon_1_0_chr1_133914113_f 0 + chr1 133917280 133917449 NM_214429_exon_2_0_chr1_133917281_f 0 + Then I suddenly found the problem when I was trying to explain it. The input file of the tool (Extract Genomic DNA) request the condition of the chromosome name which should be ,for example, 'chr1' rather than '1' . I have tackled it all day .It is really low deficient when there is not anybody instructing in face to face. Best, John To: Jennifer Jackson Subject: Re: [galaxy-user] Question about Extract intron sequences from [gtf file] + [genome FASTA file] Hi Jen, Thank you very much for your reply. The file contains more than 5000 transcripts so I don't pull out data per transcript . I do as you say and make sure the format. I filter the gff file to get a new file only containing exons information (I was wrong yesterday because I used the raw gtf file as I told in the former mail), then convert gtf to bed . So I can use (Extract Features)->Gene BED To Exon/Intron/Codon BED to get a bed file containing introns such like this: 1 9162341 9162884 CUFF.1911.1 0 - 1 22819814 22826251 CUFF.5109.1 0 + 1 25887852 25895755 CUFF.5509.1 0 - 1 25895822 25902258 CUFF.5509.1 0 - 1 39783161 39786032 CUFF.8086.1 0 + Then I met another problem: I got an empty file when I used Extract Genomic DNA to fetch sequence whether the file was gtf format or not. It returned a right result while I used the bed file downloaded from UCSC main. I think I have checked the format, but I found nothing wrong. the data downloaded from UCSC main is like this: chr1 133903980 133904133 NM_214429_exon_0_0_chr1_133903981_f 0 + chr1 133914112 133914267 NM_214429_exon_1_0_chr1_133914113_f 0 + chr1 133917280 133917449 NM_214429_exon_2_0_chr1_133917281_f 0 + I have tackled it all day .It is really of low deficiency when there is not anybody instructing in face to face. So I need some of your tips. Best, John To: 师云 Cc: galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] Question about Extract intron sequences from [gtf file] + [genome FASTA file] Hello, There appears to be something odd with the formatting of the GTF file - the exon counts are off in the second transcript's first exon. The exon_number "1" should be "2" (remember to count reverse, is on the negative strand). But that is a side issue. There are other things that do not quite make sense, but the entire dataset was not shared. Run this again, but do the following: 1 - make sure the files are in interval format and that the column assignments are correct (click on the pencil icon) 2 - Use strand assignment or better, separate (+) and (-) stranded transcripts into two files, at the start and run the query in two workflows from there. Some GOPS tools work best this way. Also, be aware that some of these transcripts will not have intron output. For example, the first transcript in your example is a single exon transcript. Also, if you have genes with overlapping variant transcripts, these will interfere with the query (you will lose introns or fractions of introns), but I don't know how large of a dataset you are working with. If you want to pull out data per transcript, the tools in the group "Filter and Sort" can be used to subset GFF/GTF files. The last query that you ran is the ideal way to run to obtain this information in Galaxy, but the GFF to BED converter creates a BED6, not a BED12 file, and this is why the tool produced no output (see the tool form for required input). Having this tool accept GTF formatted input might be something to consider as an enhancement - I will run it by our development team and open a Trello ticket as appropriate. Another method, which may not be available to you, (from looking at the chromosome identifiers - these are not UCSC chrom IDs) -- but could help in the future or others now, is to use the UCSC Table browser. It goes something like this: 1 - Click on "display at UCSC Main" for a GTF dataset, this loads the data as a custom track, default display in assembly viewer 2 - Once in UCSC, at the top bar, pick Tools -> Table Browser 3 - In the Table Browser, change track group to "Custom Tracks" and the user track you just loaded will be there 4 - Change region = genome, then output = bed, and make sure "Send output to Galaxy" is checked, submit 5 - On the next form, you will be given a list of regions to output in the BED6 output, Introns are one of them Best, Jen Galaxy team Dear Jen, I am not much of a Galaxy user yet. Some days ago I know something about Galaxy and found it a really wonderful tool. And I am confused by a simple question regarding how to extract intron sequences from [gtf file]; Here is a simple of a gtf file: 1 Cufflinks transcript 3 22 1000 + . gene_id "CUFF.26"; transcript_id "CUFF.26.1"; 1 Cufflinks exon 3 22 1000 + . gene_id "CUFF.26"; transcript_id "CUFF.26.1"; exon_number "1"; 1 Cufflinks transcript 10 40 1000 - . gene_id "CUFF.204"; transcript_id "CUFF.204.1"; 1 Cufflinks exon 10 15 1000 - . gene_id "CUFF.204"; transcript_id "CUFF.204.1"; exon_number "1"; 1 Cufflinks exon 30 40 1000 - . gene_id "CUFF.204"; transcript_id "CUFF.204.1"; exon_number "1"; I want to extract intron from the [gtf] file. I found 2 ways may solve the question but it is both useless; 1. I use (Filter and Sort) -> Filter to cut the [gtf] file into 2 files such as the follows: File A ( contain transcript ): 1 Cufflinks transcript 3 22 1000 + . gene_id "CUFF.26"; transcript_id "CUFF.26.1"; 1 Cufflinks transcript 10 40 1000 - . gene_id "CUFF.204"; transcript_id "CUFF.204.1"; File B ( contain exon): 1 Cufflinks exon 3 22 1000 + . gene_id "CUFF.26"; transcript_id "CUFF.26.1"; exon_number "1"; 1 Cufflinks exon 10 15 1000 - . gene_id "CUFF.204"; transcript_id "CUFF.204.1"; exon_number "1"; 1 Cufflinks exon 30 40 1000 - . gene_id "CUFF.204"; transcript_id "CUFF.204.1"; exon_number "1"; Then I use (Operate on Genomic Intervals)->Subtract to subtract File B from File A Return Non-overlapping pieces of intervals. I thought it will return a file containing intron But the result is an empty file; 2. I convert [gtf] file to [Bed] file ,and use (Extract Features)->Gene BED To Exon/Intron/Codon BED, and it return the same result, an empty file. I think it must be something wrong with my thoughts. So I really need your help. Thank you very much. sincerely yours, John ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ -- Jennifer Hillman-Jackson http://galaxyproject.org

ADD COMMENT • link written 5.3 years ago by 师云 • 110

Please log in to add an answer.

Similar posts • Search »