Question: filter by fasta ids
0
gravatar for jasminebro2
3.8 years ago by
United States
jasminebro20 wrote:

Hi. I am new to using Galaxy and I'm trying to download useful tools. I have a set of masked files from RepeatMasker and would like to use the coordinates I have for repeat elements to extract the elements and flanking regions.

I would like to use filter_by_fasta_ids to give a list of IDs to extract that information but I am not sure what the input format of the IDs should be . I have searched the web and tried the online version and still no luck. Thanks in advance.

 

fasta galaxy input • 1.3k views
ADD COMMENTlink modified 3.8 years ago by Jennifer Hillman Jackson25k • written 3.8 years ago by jasminebro20
3
gravatar for Jennifer Hillman Jackson
3.8 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

It is a bit difficult to understand what you are trying to do, so if I get this wrong, just clarify and we can offer more help.

If you have genome regions defined by coordinates, based on a specific fasta dataset, then you can use the tool "Extract Genomic DNA" to pull out the sequence for just those coordinate regions. They key here is to create a custom reference genome for the fasta dataset. The tool states that it is for "Genomic DNA", but this works with nearly any fasta dataset (very large NGS datasets are probably the exception, the job may exceed compute resources).

How to turn a fasta dataset into a custom reference genome is defined here:
http://wiki.galaxyproject.org/Support#Custom_reference_genome

Since you also want flanking regions for your coordinates to be included, first use the tool "Get flanks" to extended the coordinates, then use the Extract tool. The Extract tool uses the chromosome (sequence) identifiers in the genome and the coordinate file to make a match. They must be exact. Use BED format for the coordinate input for best results.

All of these are tools found on the public Main Galaxy instance at http://usegalaxy.org. Use the "search" at the top of the tool form to locate them. At the bottom of each tool form is a link to the Tool Shed repository the wrappers are based on.

The tool 'filter_by_fasta_ids' is most likely not what you are needing, but again, I may have misunderstood your question. Please send more details if that is true.

Best, Jen, Galaxy team

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Jennifer Hillman Jackson25k

Hi. Sorry about the confusion let me clarify. I have 4 files that contain numerous individual fasta files. 3 of the files are for species that do not have a complete genome sequenced(The 4th file is Squirrel Monkey). (I downloaded all these individual files from GenBank)

I took the 4 files and ran them through RepeatMasker with the preferences of my choice. I now have the masked files from RepeatMasker and the file giving me the begin and end (coordinates) for where I can find the repeat element of my choice in each individual fasta file.

So I was thinking I could use 'filter_by_fasta_id' and use the begin and end points as coordinates for each individual file. However, I was not sure what format the coordinates should be in since I have individual fasta files that may not be from the same region of the genome or may not have chromosome information listed.

I did not think about creating a custom reference Genome but that sounds like a good idea. I will look into the wiki link you posted above.

I hope this clarifies.

Thanks!

-Jasmine

ADD REPLYlink written 3.8 years ago by jasminebro20
1

Thanks Jasmine for the extra info. I think the custom genome/Extract method will work for your case. Good luck with the project, Jen, Galaxy team

ADD REPLYlink written 3.8 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 180 users visited in the last hour