Hello I would like to use salmon to do RNA seq analysis. I run it successfully with ensembl reference files but I prefer to use NCBI /UCSC files ,,any suggestions..? Please help me to find out the appropriate human reference transcriptome and GTF file. thanks Seby
Hello,
Annotation GTF datasets can be extracted from the UCSC Table browser directly into Galaxy (Get Data > UCSC main). The problem will be that the gene_id and transcript_id attributes will have same content from this source (both will be the transcript_id value). This is true for all GTF datasets extracted from the UCSC Table browser and is not related to the track chosen, the genome, or if the "Send to Galaxy" option is used or not.
Salmon needs distinct values for transcript and gene - whether inputting a GTF or a tabular transcript-gene annotation mapping dataset. There are ways to extract other datasets from UCSC (the gene value is included in other linked tables) and replace the gene_id value in the GTF but the processing is not straightforward.
A better alternative is the iGenomes version of the reference annotation. This is based on the UCSC RefSeq Genes track. Find these linked under Homo sapiens >> UCSC/hg38 or UCSC/hg19 at their website. Pick the genome that you are using in other steps. The data will be a match for the built-in genome indexes available across all tools at Galaxy main https://usegalaxy.org that are named hg38 or hg19.
Source: https://support.illumina.com/sequencing/sequencing_software/igenome.html
How to upload: Download the target iGenomes tar.gz archive to your computer, uncompress it locally, then upload just the genes.gtf dataset to Galaxy. This version of the annotation also includes extra attributes that are utilized by HISAT2, Cufflinks, Cuffmerge, Cuffdiff -- specifically: tss_id, p_id, and gene_name -- making it the best option if those are also part of your analysis workflow.
Galaxy tutorials: https://galaxyproject.org/learn/
Support FAQs: https://galaxyproject.org/support/
Hope that helps! Jen, Galaxy team