Question: Converting transcript data from Salmon to gene level TPM
0
gravatar for dw2p
7 months ago by
dw2p0
dw2p0 wrote:

I have got TPM data from running Salmon on RNA-seq data. However, this is TPM for each individual transcript (often multiple different ones per gene). I want to collapse multiple transcripts to single genes before running DESeq2. Is there a way to do this in Galaxy? Preferably, a simple way for someone new to this.

tpm salmon galaxy deseq2 rna-seq • 770 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by dw2p0
0
gravatar for Jennifer Hillman Jackson
7 months ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

Salmon can output both transcript and gene level TPM counts. You will need to provide a file of gene-to-transcript mapping. This is the last option on the Salmon tool form and the label starts with "File containing a mapping of transcripts to genes."

The transcript-to-gene mapping is a tabular dataset or GTF dataset (with gene_id and transcript_id populated) and can be also used with DeSeq2. This is a required input for DeSeq2 when using TPM counts as input instead of counts from featurecounts or htseq_count.

Tutorials: https://galaxyproject.org/learn/

Hope that helps! Jen, Galaxy team

ADD COMMENTlink modified 7 months ago • written 7 months ago by Jennifer Hillman Jackson25k
0
gravatar for dw2p
7 months ago by
dw2p0
dw2p0 wrote:

Thank you.

Seems I had a couple of problems there - first off, I was using a gene to transcript mapping file with too much information (too many columns). I trimmed this down to two columns: the transcript id (#mm10.knownGene.name) and the official gene symbol (mm10.kgXref.geneSymbol), and that partly worked. The additional problem is that there is a hand-full of official gene symbols that get spread across 2 or more columns (spaces commas etc?). This put some of it out of register. Cutting only columns 1 and 2 to a new file worked, and it seems to be fine now. Maybe I didn't start with the best format gene to transcript mapping file.

ADD COMMENTlink written 7 months ago by dw2p0

All of this sounds like the correct way to troubleshoot. Inputs format can really make a difference in how content is interpreted by tools (whether used in Galaxy or elsewhere).

One-to-many transcript-to-gene mapping is present in the UCSC "Known Genes" track when combined with many the related Xref tables (by design). If you want to try a simpler 1-1 transcript-to-gene mapping instead (and can utilize those identifiers), the UCSC track "RefSeq Genes" is another option. All the data is in the primary table with the RefSeq gene name in the column "name2". To format, extract the entire table from the Table Browser into Galaxy, then use Cut to isolate just the "transcript -tab- gene" data. Or use the Table Browser's option to output "selected columns from the primary and related tables" and only pick the transcript+gene columns for extraction to Galaxy.

Glad this worked out!

ADD REPLYlink modified 7 months ago • written 7 months ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 139 users visited in the last hour