Can a genome that's been released be added so I can map against it

Question: Can a genome that's been released be added so I can map against it

13 months ago by

Dennis • 10

Dennis • 10 wrote:

Hi there,

I will be mapping my RNAseq reads to a genome of interests soon. Mapping with Bowtie2 gives me an option of either mapping to an uploaded genome or asking Galaxy support to add the index.

The genome of my model organism has been sequenced and is available at http://cabbagelooper.org

Would you be able to upload the genome/index so I can use it as a built-in option?

Thank you and best regards, Dennis

rna-seq • 422 views

ADD COMMENT • link •

modified 13 months ago • written 13 months ago by Dennis • 10

13 months ago by

Dennis • 10

Dennis • 10 wrote:

Hello again,

The recommended guide for adding a custom genome asks to "Make sure the chromosome identifiers are a match for other inputs".

In the data I have access to (link above), I can't find any chromosome identifiers. Scaffolds contigs files contain names only followed by sequences in fast format. I also have a GFF genes file that has seqid, source, type, start, end, score, strand, phase, and attributes columns. And lastly, a transcripts file with name, sequence, and a bunch of numbers in the header of every sequence like so:

TNI017526-RC transcript offset:832 AED:0.51 eAED:0.51 QI:832|0.62|0.66|0.66|0.62|0.55|9|0|676

Would you be able to tell me where would I find chromosome identifiers in these files?

ADD COMMENT • link written 13 months ago by Dennis • 10

Modify the title lines in your fasta dataset (whether chrom, scaffold, or contig) to be an exact match for the seqid in the associated GFF3 reference annotation dataset.

The GFF3 dataset in the link above has seqid's like this:

tig00002017_group18

So your fasta title lines need to be an exact match like this (whether from scaffolds or contigs - if not already a match). I didn't download and examine those data and to find out which the annotation is mapped to and what the format looks like:

>tig00002017_group18

To fix: use the tool convert Fasta-to-Tabular, then you can use tools in the group Text Manipulation to modify the lines (get the right/matched identifier content into its own column, separated by tabs from all other data), then convert Tabular-to-Fasta picking just the isolated identifier column + sequence column. Run NormalizeFasta at the end to wrap the sequences to 80 bases per line. Often NormalizeFasta, when also used to split the title line on the first white-space, is enough, but you'll need to compare the two datasets to see if that easier option will work for you. Other times the fasta identifier is already isolated on the fasta title line (ex: the fasta has no description content) and so is already a match, but you'll need to check.

All is covered in the support FAQs but can be hard to map to specific data/action the first time through. Unfortunately, many data sources provide content in a mismatched way and it has to be cleaned up first to work properly with tools, whether the Galaxy wrapped version or the line-command version. Getting the inputs in the right format at the start of an analysis will save many headaches later by avoiding format-related tool errors and trying to decode those that are not descriptive of the core problem. Most tools assume formatting is already taken care of as not being a factor, and trapping each case with a smart error message that states "this is what is exactly wrong and this is how to fix it" is nearly an impossible task for the original tool author to add and certainly a difficult item to add within a Galaxy wrapper around that tool.

Specifically, start with these for reformatting common solutions:

https://galaxyproject.org/support/chrom-identifiers/
https://galaxyproject.org/support/troubleshoot-an-error/ << bit of a laundry list, but learning how to use these tools will make you a stronger analyst that can handle any (or at least most!!) data you may come across and want to use.

And maybe look at this FAQs, too, since shows the types of errors that can come up from mismatched inputs (with links in example context, back to the above help plus other FAQ resources). Be aware that mapping may go fine, but once invested in that processing, the error comes up, requiring you to go back to the start - fixing the custom genome/other inputs - then remapping plus downstream steps (not fun!).

https://galaxyproject.org/support/tool-error/

If you need help to understand the content of datatypes, please see the datatypes FAQ in this section or just google it directly to review the specification:

https://galaxyproject.org/support/#getting-inputs-right-

If you get stuck and are working at https://usegalaxy.org or can load/reproduce a problem there or simply have a history there you have been working in to reformat but it doesn't seem to be working out well, keep all of your initial and intermediate datasets/steps (don't delete), and send in a bug report (if a tool fails a "green bug" icon will be present in the red error dataset once expanded) or if there are no error datasets, generate a history share link and send it directly to galaxy-bugs@lists.galaxyproject.org from your registered galaxy account email address. Include a link to this post for reference and note what the starting datasets are by number. We'll be able to figure out from there (usually!!) what to change to get the data in synch and help you out.

Just to complicate it (sorry) transcript identifiers in other types of input reference data must be an exact match for transcript identifiers in other inputs. But the same general processes apply - get the data in synch (a match) before using it.

Galaxy tutorials: https://galaxyproject.org/learn/

Hope this helps! Jen, Galaxy team

ADD REPLY • link modified 13 months ago • written 13 months ago by Jennifer Hillman Jackson ♦ 25k

Hi Jen,

Thank you so much - I'll try to work through this. I already normalized all the FASTA files and clipped the title lines at the first white space, but they don't seem to match the GFF file - I downloaded the normalized contigs and they have names like >tig000033_pilon while the GFF file when I open it (however many lines I can see in Galaxy) has names like tig00004086 and tig00002049_group1.

However, the scaffolds file seems to match the GFF file well - I was able to ctrl-F all the title lines between the GFF file and the scaffolds file - so maybe I can align to that? If not, the title lines in the contigs and the GFF file clearly don't match up - I will then me fiddle with the tools you suggested and I'll see what stumps me next :)

Thank you so much for your help! This is an absolutely outstanding community!

Also, is there a way to visualize the entire GFF file - I can only see a few lines in Galaxy.

Best, Dennis

ADD REPLY • link modified 13 months ago • written 13 months ago by Dennis • 10

Similar posts • Search »