Galaxy Reference Genome

Question: Galaxy Reference Genome

6.5 years ago by

Hello, Galaxy Main 1.) I am having trouble adding annotations to my Tophat and Cufflinks tools. I used the Mus.Musculus 9MM reference using the *built in index. *For the Tophat mapping but no annotations were available in the output files. I then tried converting the the Ref Genome from the UCSC to a SAM file using Sam Tools. Tophat would not recognize this but Cufflinks did. The Cufflinks output file did not have the annotation either. Any thoughts on the proper way to add annotations? 2.) I am also trying to filter the single mapped reads from the multiple mapped reads that resulted from Tophat. After converting the output file from Tophat I used the filter tool in the Sam Tools choosing *0x100 map is not primary. *Afterwards I tried to run Cufflinks on the filtered output only to have it fail. My ultimate goal is to look at RNA seq gene expression. I know that I have to upload my files -> groom using FASTQ groomer -> download a reference sequence from UCSC -> convert the reference genome file to a usable format ->Run Tophat for mapping using the groomed file and the converted reference annotation -> Filter the single mapped reads -> Run cufflinks using the filtered single mapped reads from Tophat. now I need to get this basic pipeline to work. Thanks, Kristen Roop

rna-seq cufflinks • 3.0k views

ADD COMMENT • link •

modified 6.5 years ago by Jennifer Hillman Jackson ♦ 25k • written 6.5 years ago by Kristen Roop • 10

6.5 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello Kristen, Our RNA-seq tutorial and FAQ can help out with the general workflow: https://main.g2.bx.psu.edu/u/jeremy/p/galaxy-rna-seq-analysis-exercise https://main.g2.bx.psu.edu/u/jeremy/p/transcriptome-analysis-faq And an iGenomes reference annotation GTF dataset for mm9 is in the Shared Libraries here: (Import " genes.gtf" to your history, please ignore other content as it is under revision) http://usegalaxy.org -> Shared Data -> Data Libraries -> iGenomes -> mm9 To address your questions, one key misunderstanding may be the difference between a "reference genome" and a "reference annotation" dataset. * "reference genome" = genomic sequence (sourced in .fasta format) that the data is mapped against with TopHat and used as a scaffold for the RNA-seq tools. Since you are using mm9, selecting the "built-in index" for mm9 is an appropriate choice. A reference genome does not provide annotation beyond genomic positional coordinates. When using a mapping tool, including TopHat, there are mapping parameters that can be set to specify whether to keep only the best or all hits - it sounds as if you need to adjust these parameters in your run. The filter you ran (question #2) may have removed most or all hits - check the output from the SAM filter, was the output greatly reduced or empty? If so, re-run TopHat with parameters that keep the best hit from the start and move to Cufflinks from there without filtering through SAMTools. Help is on the tool form itself and in the links to the manual. * "reference annotation" = known transcripts (sourced in .gtf or .gff3 format) that are also mapped against the reference genome. These transcript annotations are the most useful when they contain gene, transcript start site, and other key attributes that the Cuff* tools can interpret. This annotation can guide assembly at various levels (loose or strict) depending on how the tool parameters are configured. The annotation MUST be mapped to the same exact reference genome that your FASTQ datasets are mapped to, with the same exact chromosome naming (see the RNA-seq FAQ for details). Help is also on the Cuff* tools including links to the manuals. More help, including links to tool help is on our wiki here: (see ' Tools on the Main server: Example: unexpected results with RNA-seq analysis tools.) http://wiki.g2.bx.psu.edu/Support#Interpreting_scientific_results Hopefully this helps, Jen Galaxy team -- Jennifer Jackson http://galaxyproject.org

ADD COMMENT • link written 6.5 years ago by Jennifer Hillman Jackson ♦ 25k

This explanations is very clear. Thank you I was wondering about some of these issues as well. It would be wonderful if Galaxy could somehow make it possible to provide a "bed" file for the G option or make it feasible to use GTF/bed output from the Table Browser tool as input to the G option. (Maybe it already does?) GTF is an awkward format and BED would work just as well, if not better. Best wishes, Ann Loraine Ann Loraine, Ph.D. Associate Professor Department of Bioinformatics and Genomics University of North Carolina at Charlotte North Carolina Research Campus 600 Laureate Way Kannapolis, NC 28081 704-250-5750 aloraine@uncc.edu http://www.transvar.org http://www.bioviz.org http://www.uncc.edu Date: Wed, 13 Jun 2012 16:12:27 -0700 To: Kristen Roop <kristen.roop@gmail.com<mailto:kristen.roop@gmail.com>> Cc: <galaxy-user@bx.psu.edu<mailto:galaxy-user@bx.psu.edu>> Subject: Re: [galaxy-user] Galaxy Reference Genome Hello Kristen, Our RNA-seq tutorial and FAQ can help out with the general workflow: https://main.g2.bx.psu.edu/u/jeremy/p/galaxy-rna-seq-analysis-exercise https://main.g2.bx.psu.edu/u/jeremy/p/transcriptome-analysis-faq And an iGenomes reference annotation GTF dataset for mm9 is in the Shared Libraries here: (Import " genes.gtf" to your history, please ignore other content as it is under revision) http://usegalaxy.org -> Shared Data -> Data Libraries -> iGenomes -> mm9 To address your questions, one key misunderstanding may be the difference between a "reference genome" and a "reference annotation" dataset. * "reference genome" = genomic sequence (sourced in .fasta format) that the data is mapped against with TopHat and used as a scaffold for the RNA-seq tools. Since you are using mm9, selecting the "built-in index" for mm9 is an appropriate choice. A reference genome does not provide annotation beyond genomic positional coordinates. When using a mapping tool, including TopHat, there are mapping parameters that can be set to specify whether to keep only the best or all hits - it sounds as if you need to adjust these parameters in your run. The filter you ran (question #2) may have removed most or all hits - check the output from the SAM filter, was the output greatly reduced or empty? If so, re-run TopHat with parameters that keep the best hit from the start and move to Cufflinks from there without filtering through SAMTools. Help is on the tool form itself and in the links to the manual. * "reference annotation" = known transcripts (sourced in .gtf or .gff3 format) that are also mapped against the reference genome. These transcript annotations are the most useful when they contain gene, transcript start site, and other key attributes that the Cuff* tools can interpret. This annotation can guide assembly at various levels (loose or strict) depending on how the tool parameters are configured. The annotation MUST be mapped to the same exact reference genome that your FASTQ datasets are mapped to, with the same exact chromosome naming (see the RNA-seq FAQ for details). Help is also on the Cuff* tools including links to the manuals. More help, including links to tool help is on our wiki here: (see ' Tools on the Main server: Example: unexpected results with RNA- seq analysis tools.) http://wiki.g2.bx.psu.edu/Support#Interpreting_scientific_results Hopefully this helps, Jen Galaxy team Hello, Galaxy Main 1.) I am having trouble adding annotations to my Tophat and Cufflinks tools. I used the Mus.Musculus 9MM reference using the built in index. For the Tophat mapping but no annotations were available in the output files. I then tried converting the the Ref Genome from the UCSC to a SAM file using Sam Tools. Tophat would not recognize this but Cufflinks did. The Cufflinks output file did not have the annotation either. Any thoughts on the proper way to add annotations? 2.) I am also trying to filter the single mapped reads from the multiple mapped reads that resulted from Tophat. After converting the output file from Tophat I used the filter tool in the Sam Tools choosing 0x100 map is not primary. Afterwards I tried to run Cufflinks on the filtered output only to have it fail. My ultimate goal is to look at RNA seq gene expression. I know that I have to upload my files -> groom using FASTQ groomer -> download a reference sequence from UCSC -> convert the reference genome file to a usable format ->Run Tophat for mapping using the groomed file and the converted reference annotation -> Filter the single mapped reads -> Run cufflinks using the filtered single mapped reads from Tophat. Thanks, Kristen Roop ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ -- Jennifer Jackson http://galaxyproject.org ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

ADD REPLY • link written 6.5 years ago by Loraine, Ann • 60

6.5 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Loraine, I'm glad this was helpful! Linking in the reference annotation can be one of the biggest hurdles when beginning an RNA-analysis project. Extracting GTF data directly from the UCSC table browser is currently possible (and is what was used in the RNA-seq tutorial link I shared), but data from this source does not contain all of the attributes in the 9th field that the Cuff* tools can utilize. This results in missed functionality. GTF data can also be obtained from Ensembl. These files have one extra attribute that the UCSC GTF files do not have, but still do not have the all of the possible attributes, and so also result in missed functionality when used. The iGenomes dataset, linked from the Cufflinks website (below) contains versions of GTF files from both of these sources that have been modified to include the full compliment of attributes. There was an updated release in May and the Galaxy team has plans to include more iGenomes reference annotation GTF files in the future in Shared Libraries. http://cufflinks.cbcb.umd.edu/igenomes.html The iGenome additional attribute content itself is present at the original sources - but contained in related tables or otherwise formatted in ways that the Cuff* tools cannnot use. Currently, the iGenomes data covers only a small number of genomes. Really, anyone with the bioinformatics skills to do the work could create a complete Cuff* compatible GTF file for any genome contained at UCSC, Ensembl, or other reference genome source that has RefSeq or another other stable gene/transcript annotation track, using the ancillary tables and some simple scripting to do the file manipulation. If they wanted, these resulting GTF files could be loaded into a history and shared with the Galaxy community using existing "Share or Publish" options. Any shared GTF dataset that was well constructed (tested for accuracy) and clearly labeled with sources, etc., I'm sure would be a greatly appreciated contribution. About GTF versus BED - this is a good question. GTF format is what the tool authors selected as the input and it was a good choice over BED format. I can explain why. It is very true that BED files are easier to manipulate - all data are in distinct columns and most users are already familiar with tabular data and BED format in particular. The 9th field of GTF files are difficult to work with but this is also the part of the file that is used for most of the conclusion layer functions the Cuff* tools perform. GTF and BED files are similar in some ways, differ in others (coordinate system), but most importantly BED does not contain the attributes field - the key data that the Cuff* tools use to group and annotate data beyond genomic coordinates. There just isn't a good place to put this data in the BED data format specification. Besides, ... the tool authors get to decide this sort of thing :) Thanks for a good discussion and the opportunity to share some more info about these tools & inputs! Jen Galaxy team -- Jennifer Jackson http://galaxyproject.org

ADD COMMENT • link written 6.5 years ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »