how prepare File containing a mapping of transcripts to genes under Salmon

6 weeks ago by

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

The two column tabular dataset should contain:

transcript_id <tab> gene_id

Where <tab> is a whitespace tab character. There should be no headers, no extra spaces, no extra tabs, and no trailing empty lines. The transcript_id and gene_id should not be the same term/value.

The transcript_id should be an exact match for the transcript fasta identifiers in your reference transcriptome. That fasta should have no description content on the title line (">" line). It should only have the sequence identifier for the transcript. Often the tool NormalizeFasta is enough to clean up a fasta dataset and sometimes more text manipulation is needed to reformat the identifier (it depends on where the transcriptome was sourced).

The formatting rules for Custom Genomes are the same as for Custom Transcriptomes: FAQs: https://galaxyproject.org/support/

Preparing and using a Custom Reference Genome or Build https://galaxyproject.org/learn/custom-genomes/
Mismatched Chromosome identifiers (and how to avoid them) https://galaxyproject.org/support/chrom-identifiers/

For the outputs, when all is set up correctly, the Quantification data will have the transcript names in the first column and Gene Quantification data will have the gene names in the first column.

Please check your inputs against the above and let us know if you need more help with that part.

For this part of your question, I'm not sure what you mean. "moreover, I have tried to send a specific table with these two columns directly on ensemble but couldn't send any query to a galaxy". Could you explain more about what steps you are doing and what is going wrong?

Thanks, Jen, Galaxy team

ADD COMMENT • link modified 6 weeks ago • written 6 weeks ago by Jennifer Hillman Jackson ♦ 25k

Hello,

thank you very much.

I meant when I want to to get table include transcript_id and gene_id directly from get data, UCSC Main table browser, under group Gene and Gene predictions, Track UCSC genes, table Known genes, output format secelted fields from primary and ;;; finally send query to Galaxy, That will encounter with error. The remote data source application may be off line, please try again later. Error: <urlopen error="" [errno="" 110]="" connection="" timed="" out="">

ADD REPLY • link modified 6 weeks ago • written 6 weeks ago by Leila Kian • 10

Excuse me one more question if I wanted to download these two columns separately on UCSC table browser under Select Fields from mm10.wgEncodeGencodeCompVM16, we have just name Name of gene (usually transcript_id from GTF) and name2 Alternate name (e.g. gene_id from GTF), there is no gene_id to select according to Genecode!

ADD REPLY • link written 6 weeks ago by Leila Kian • 10

each sequence identifier of transcriptome file is like this:

>ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|RP23-271O17.1-001|RP23-271O17.1|1070|TEC|

is it correct?

ADD REPLY • link written 6 weeks ago by Leila Kian • 10

Hello,

For this case, the field "name" is the transcript identifier and "name2" is the gene identifier. Just pick to download those two fields and you'll have the correct data for the two column tabular file.

Alternatively, you can get the transcript/gene from the fasta header lines (the first two annotations in the line). You'll need to modify this data anyway to have a valid transcript fasta.

For the fasta transformation:

Convert Fasta-to-Tabular
Convert the "|" (pipes) to tabs with Convert delimiters to TAB
Generate the target transcriptome: Convert Tabular-to-Fasta picking the transcript_id (first column) and the sequence (last column) from the output of step 2
Wrap the sequence lines to 80 bases with the tool FASTA Width formatter

Alternative transcript/gene tabular data:

Isolate just the first two columns with the tool Cut from the output of step 2

This is the FTP source and a description of the data contained in each file. Review the README -- you probably do not want all of the comprehensive transcripts. The fasta transcripts can be filtered by biotype when the data is in the tabular format (tool: Select). Do this before creating a transcript-gene tabular dataset and before creating the final, matching, fasta transcript dataset.

Includes the GTF, transcript fasta, and associated data:

Version M16 ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M16/
Version M19 (most current): ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/.

Help:

https://www.gencodegenes.org/mouse/ See "Data Format" and "FTP"

ADD REPLY • link written 6 weeks ago by Jennifer Hillman Jackson ♦ 25k

thanks for compelet explanations, I tried to do that step by step, finally it will face with error, I dont know why

error An error occurred with this dataset: Version Info: Could not resolve upgrade information in the alotted time. Check for upgrades manually at https://combine-lab.github.io/salmon Fatal error: Exit code 1 () [2018-10-16 16:50:19.537] [jLog] [info] building index RapMap Indexer Fatal error: Exit code 1 () [2018-10-16 16:50:19.537] [jLog] [info] building index RapMap Indexer

[Step 1 of 4] : counting k-mers Elapsed time: 1.02779s

Replaced 0 non-ATCG nucleotides Clipped poly-A tails from 0 transcripts Building rank-select dictionary and saving to disk done Elapsed time: 3.4246e-05s Writing sequence data to file . . . done Elapsed time: 5.264e-05s [info] Building 32-bit suffix array (length of generalized text is 0) Building suffix array . . . FAILURE: return code from libdivsufsort() was -1

ADD REPLY • link written 6 weeks ago by Leila Kian • 10

This error indicates that the job is running out of resources.

Did you filter the transcriptome fasta to only include complete full-length transcripts?
Are the sequence identifiers in that custom transcriptome an exact match from the transcript_id names in your reference annotation (whether tabular or GTF)?
Can you reproduce this error at a public Galaxy? Try https://usegalaxy.org. If the job still fails there, a bug report from the error can be sent in for feedback. Be sure to leave all datasets undeleted and include a link to this Biostars post in the comments so we can associate the two.

FAQs: https://galaxyproject.org/support/#unexpected-results

My job ended with an error. What can I do?

ADD REPLY • link modified 6 weeks ago • written 6 weeks ago by Jennifer Hillman Jackson ♦ 25k

yes, I have filtered two column of transcriptome file to be coincident exactly and filterd transcriptome fasta include transcript identifier and Wrap the sequence lines to 80 bases with the tool FASTA Width formatter I couldn't send the error report! because of this probability, I am not Admin! An error occurred sending the report by email: Mail is not configured for this galaxy instance

but this a complete error

Fatal error: Exit code 1 () [2018-10-16 16:50:19.537] [jLog] [info] building index RapMap Indexer

[Step 1 of 4] : counting k-mers Elapsed time: 1.02779s

ADD REPLY • link written 6 weeks ago by Leila Kian • 10

If you can reproduce the error at https://usegalaxy.org, the bug report can be submitted.

ADD REPLY • link written 5 weeks ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »