Question: Re: Rna Seq Analysis And Gtf Files
0
gravatar for Jeremy Goecks
7.7 years ago by
Jeremy Goecks2.2k
Jeremy Goecks2.2k wrote:
Slim, Please send questions to the galaxy-user mailing list (cc'd) rather than individual Galaxy team members; there are many people on the list that may be able to address your question, and discussions are archived for future use as well. Without seeing your analysis, I'd suggest trying two things: (1) Provide gene annotation reference file to Cufflinks as well as Cuffcompare and Cuffdiff; in other words, you'll want to do guided assembly. (2) Try using an Ensembl GTF, which has the gene name in the attributes. I think (2) is more likely to generate the results you want, but there are the many known problems in using Ensembl GTFs with Cufflinks/compare/diff. Good luck, J.
rna-seq cufflinks • 1.9k views
ADD COMMENTlink modified 7.6 years ago • written 7.7 years ago by Jeremy Goecks2.2k
0
gravatar for David K Crossman
7.7 years ago by
United States
David K Crossman130 wrote:
Hello! I would like to ask a question related to this thread below. I ran into the same issues as below and was unaware of having to swap some columns around in the GTF file. So, after 'swapping the gene name from the complete table (name2 value, column 12) into the GFT file's gene_id value (which by default is the same as transcript_id)," I uploaded this "patched" file (mm9) into Galaxy and ran Cufflinks, CuffCompare and CuffDiff using this "patched" GTF file as the reference annotation. For both Cufflinks and CuffCompare, the gene_id was present in their respective columns. The problem I have encountered now is that in all of the output files in CuffDiff, the gene_id column is blank (contains a "-"; highlighted in yellow below). This example is from the CuffDiff gene expression output file: test_id gene locus sample_1 sample_2 status value_1 value_2 ln(fold_change) test_stat p_value significant XLOC_000001 - chr1:4797973-4836816 q1 q2 OK 73.1908 82.1567 0.115559 -0.71896 0.472168 no XLOC_000002 - chr1:4847774-4887990 q1 q2 OK 81.7264 53.1165 -0.43089 2.44474 0.014496 no XLOC_000003 - chr1:5073253-5152630 q1 q2 OK 408.289 333.749 -0.20159 2.73173 0.0063 no XLOC_000004 - chr1:5578573-5596214 q1 q2 NOTEST 2.34764 4.79772 0.71473 -0.89735 0.369532 no What am I doing wrong? I am interested in the differentially expressed genes in this RNA-Seq dataset (as well as calling variants, which is my next step, but want to get this answered first before moving on). Any info, suggestions or help would be greatly appreciated. Thanks, David To: <ssassi@ccib.mgh.harvard.edu> Cc: galaxy-user Subject: Re: [galaxy-user] RNA seq analysis and GTF files Slim, Please send questions to the galaxy-user mailing list (cc'd) rather than individual Galaxy team members; there are many people on the list that may be able to address your question, and discussions are archived for future use as well. Without seeing your analysis, I'd suggest trying two things: (1) Provide gene annotation reference file to Cufflinks as well as Cuffcompare and Cuffdiff; in other words, you'll want to do guided assembly. (2) Try using an Ensembl GTF, which has the gene name in the attributes. I think (2) is more likely to generate the results you want, but there are the many known problems in using Ensembl GTFs with Cufflinks/compare/diff. Good luck, J. ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
ADD COMMENTlink written 7.7 years ago by David K Crossman130
David, can you please share your history with me and I'll take a look (History Options --> Share/Publish --> Share with User --> my email? Thanks, J.
ADD REPLYlink written 7.6 years ago by Jeremy Goecks2.2k
0
gravatar for Jeremy Goecks
7.6 years ago by
Jeremy Goecks2.2k
Jeremy Goecks2.2k wrote:
David, Your analysis looks reasonable. In fact, in your isoform tracking FPKM file you get nearest_ref_id, so that's promising. What I think is needed is the addition of an attribute called gene_name to your reference file; you can use whatever value you want for gene name, and using the same value as gene_id probably makes sense. Rerun your analysis with the further-patched GTF file, and let us know if this doesn't solve the problem. Also note that even using this attribute, some gene name/ids and some nearest_ref_id columns will not be populated in some cuffdiff files. See the post from Howie in this thread for an explanation from a Cufflinks developer: http://seqanswers.com/forums/showthread.php?t=6288 Best, J.
ADD COMMENTlink written 7.6 years ago by Jeremy Goecks2.2k
Jeremy, Thank you very much for this information. One quick question. I added the gene_id values to the 10th column of my patched GTF file. After uploading it to Galaxy, the column doesn't have a name (i.e. column 1 = Seqname; column 2 = Source; etc...). Do I need to assign it a name (i.e. gene_name or gene_id) for it to be recognized and if so, how do you assign column names to GTF files? Thanks, David To: David K Crossman Cc: galaxy-user Subject: Re: [galaxy-user] RNA seq analysis and GTF files David, Your analysis looks reasonable. In fact, in your isoform tracking FPKM file you get nearest_ref_id, so that's promising. What I think is needed is the addition of an attribute called gene_name to your reference file; you can use whatever value you want for gene name, and using the same value as gene_id probably makes sense. Rerun your analysis with the further-patched GTF file, and let us know if this doesn't solve the problem. Also note that even using this attribute, some gene name/ids and some nearest_ref_id columns will not be populated in some cuffdiff files. See the post from Howie in this thread for an explanation from a Cufflinks developer: http://seqanswers.com/forums/showthread.php?t=6288 Best, J. Jeremy, I've shared it with you using your email address. Thanks, David To: David K Crossman Cc: galaxy-user Subject: Re: [galaxy-user] RNA seq analysis and GTF files David, can you please share your history with me and I'll take a look (History Options --> Share/Publish --> Share with User --> my email? Thanks, J. Hello! I would like to ask a question related to this thread below. I ran into the same issues as below and was unaware of having to swap some columns around in the GTF file. So, after 'swapping the gene name from the complete table (name2 value, column 12) into the GFT file's gene_id value (which by default is the same as transcript_id)," I uploaded this "patched" file (mm9) into Galaxy and ran Cufflinks, CuffCompare and CuffDiff using this "patched" GTF file as the reference annotation. For both Cufflinks and CuffCompare, the gene_id was present in their respective columns. The problem I have encountered now is that in all of the output files in CuffDiff, the gene_id column is blank (contains a "-"; highlighted in yellow below). This example is from the CuffDiff gene expression output file: test_id gene locus sample_1 sample_2 status value_1 value_2 ln(fold_change) test_stat p_value significant XLOC_000001 - chr1:4797973-4836816 q1 q2 OK 73.1908 82.1567 0.115559 -0.71896 0.472168 no XLOC_000002 - chr1:4847774-4887990 q1 q2 OK 81.7264 53.1165 -0.43089 2.44474 0.014496 no XLOC_000003 - chr1:5073253-5152630 q1 q2 OK 408.289 333.749 -0.20159 2.73173 0.0063 no XLOC_000004 - chr1:5578573-5596214 q1 q2 NOTEST 2.34764 4.79772 0.71473 -0.89735 0.369532 no What am I doing wrong? I am interested in the differentially expressed genes in this RNA-Seq dataset (as well as calling variants, which is my next step, but want to get this answered first before moving on). Any info, suggestions or help would be greatly appreciated. Thanks, David To: <ssassi@ccib.mgh.harvard.edu<mailto:ssassi@ccib.mgh.harvard.edu>> Cc: galaxy-user Subject: Re: [galaxy-user] RNA seq analysis and GTF files Slim, Please send questions to the galaxy-user mailing list (cc'd) rather than individual Galaxy team members; there are many people on the list that may be able to address your question, and discussions are archived for future use as well. Without seeing your analysis, I'd suggest trying two things: (1) Provide gene annotation reference file to Cufflinks as well as Cuffcompare and Cuffdiff; in other words, you'll want to do guided assembly. (2) Try using an Ensembl GTF, which has the gene name in the attributes. I think (2) is more likely to generate the results you want, but there are the many known problems in using Ensembl GTFs with Cufflinks/compare/diff. Good luck, J. ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org<http: usegalaxy.org=""/>. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
ADD REPLYlink written 7.6 years ago by David K Crossman130
David, You don't want to add an extra column to your dataset, just an extra attribute to field 9, the GTF attributes column. E.g. -- chr1 mm9_refGene start_codon 134212807 134212809 0.000000 + . gene_id "Nuak2"; transcript_id "NM_028778"; gene_name "Nuak2" -- Note that I simply appended ' <space>gene_name<space>"Nuak2" ' to the line; adding a tab anywhere will cause Galaxy to think that you wanted to create an extra column in the file. (I think this is what occurred for you.) If you add the gene_name to each line correctly, Galaxy will recognize the file as GTF format with 9 columns, even with the extra attribute. Good luck, J.
ADD REPLYlink written 7.6 years ago by Jeremy Goecks2.2k
Jeremy, I just wanted to give you an update. Adding the "gene_name" attribute with the gene name into column 9 worked extremely well for Cuffdiff. The columns I referenced below as being blank now have their respective gene name in them! Thank you very much for all your help! Thanks, David To: David K Crossman Cc: galaxy-user Subject: Re: [galaxy-user] RNA seq analysis and GTF files David, You don't want to add an extra column to your dataset, just an extra attribute to field 9, the GTF attributes column. E.g. -- chr1 mm9_refGene start_codon 134212807 134212809 0.000000 + . gene_id "Nuak2"; transcript_id "NM_028778"; gene_name "Nuak2" -- Note that I simply appended ' <space>gene_name<space>"Nuak2" ' to the line; adding a tab anywhere will cause Galaxy to think that you wanted to create an extra column in the file. (I think this is what occurred for you.) If you add the gene_name to each line correctly, Galaxy will recognize the file as GTF format with 9 columns, even with the extra attribute. Good luck, J. Jeremy, Thank you very much for this information. One quick question. I added the gene_id values to the 10th column of my patched GTF file. After uploading it to Galaxy, the column doesn't have a name (i.e. column 1 = Seqname; column 2 = Source; etc...). Do I need to assign it a name (i.e. gene_name or gene_id) for it to be recognized and if so, how do you assign column names to GTF files? Thanks, David To: David K Crossman Cc: galaxy-user Subject: Re: [galaxy-user] RNA seq analysis and GTF files David, Your analysis looks reasonable. In fact, in your isoform tracking FPKM file you get nearest_ref_id, so that's promising. What I think is needed is the addition of an attribute called gene_name to your reference file; you can use whatever value you want for gene name, and using the same value as gene_id probably makes sense. Rerun your analysis with the further-patched GTF file, and let us know if this doesn't solve the problem. Also note that even using this attribute, some gene name/ids and some nearest_ref_id columns will not be populated in some cuffdiff files. See the post from Howie in this thread for an explanation from a Cufflinks developer: http://seqanswers.com/forums/showthread.php?t=6288 Best, J. Jeremy, I've shared it with you using your email address. Thanks, David To: David K Crossman Cc: galaxy-user Subject: Re: [galaxy-user] RNA seq analysis and GTF files David, can you please share your history with me and I'll take a look (History Options --> Share/Publish --> Share with User --> my email? Thanks, J. Hello! I would like to ask a question related to this thread below. I ran into the same issues as below and was unaware of having to swap some columns around in the GTF file. So, after 'swapping the gene name from the complete table (name2 value, column 12) into the GFT file's gene_id value (which by default is the same as transcript_id)," I uploaded this "patched" file (mm9) into Galaxy and ran Cufflinks, CuffCompare and CuffDiff using this "patched" GTF file as the reference annotation. For both Cufflinks and CuffCompare, the gene_id was present in their respective columns. The problem I have encountered now is that in all of the output files in CuffDiff, the gene_id column is blank (contains a "-"; highlighted in yellow below). This example is from the CuffDiff gene expression output file: test_id gene locus sample_1 sample_2 status value_1 value_2 ln(fold_change) test_stat p_value significant XLOC_000001 - chr1:4797973-4836816 q1 q2 OK 73.1908 82.1567 0.115559 -0.71896 0.472168 no XLOC_000002 - chr1:4847774-4887990 q1 q2 OK 81.7264 53.1165 -0.43089 2.44474 0.014496 no XLOC_000003 - chr1:5073253-5152630 q1 q2 OK 408.289 333.749 -0.20159 2.73173 0.0063 no XLOC_000004 - chr1:5578573-5596214 q1 q2 NOTEST 2.34764 4.79772 0.71473 -0.89735 0.369532 no What am I doing wrong? I am interested in the differentially expressed genes in this RNA-Seq dataset (as well as calling variants, which is my next step, but want to get this answered first before moving on). Any info, suggestions or help would be greatly appreciated. Thanks, David To: <ssassi@ccib.mgh.harvard.edu<mailto:ssassi@ccib.mgh.harvard.edu>> Cc: galaxy-user Subject: Re: [galaxy-user] RNA seq analysis and GTF files Slim, Please send questions to the galaxy-user mailing list (cc'd) rather than individual Galaxy team members; there are many people on the list that may be able to address your question, and discussions are archived for future use as well. Without seeing your analysis, I'd suggest trying two things: (1) Provide gene annotation reference file to Cufflinks as well as Cuffcompare and Cuffdiff; in other words, you'll want to do guided assembly. (2) Try using an Ensembl GTF, which has the gene name in the attributes. I think (2) is more likely to generate the results you want, but there are the many known problems in using Ensembl GTFs with Cufflinks/compare/diff. Good luck, J. ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org<http: usegalaxy.org=""/>. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
ADD REPLYlink written 7.6 years ago by David K Crossman130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour