filter by fasta ids

Question: filter by fasta ids

3.8 years ago by

United States

Hi. I am new to using Galaxy and I'm trying to download useful tools. I have a set of masked files from RepeatMasker and would like to use the coordinates I have for repeat elements to extract the elements and flanking regions.

I would like to use filter_by_fasta_ids to give a list of IDs to extract that information but I am not sure what the input format of the IDs should be . I have searched the web and tried the online version and still no luck. Thanks in advance.

fasta galaxy input • 1.3k views

ADD COMMENT • link •

modified 3.8 years ago by Jennifer Hillman Jackson ♦ 25k • written 3.8 years ago by jasminebro2 • 0

3.8 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

It is a bit difficult to understand what you are trying to do, so if I get this wrong, just clarify and we can offer more help.

If you have genome regions defined by coordinates, based on a specific fasta dataset, then you can use the tool "Extract Genomic DNA" to pull out the sequence for just those coordinate regions. They key here is to create a custom reference genome for the fasta dataset. The tool states that it is for "Genomic DNA", but this works with nearly any fasta dataset (very large NGS datasets are probably the exception, the job may exceed compute resources).

How to turn a fasta dataset into a custom reference genome is defined here:
http://wiki.galaxyproject.org/Support#Custom_reference_genome

Since you also want flanking regions for your coordinates to be included, first use the tool "Get flanks" to extended the coordinates, then use the Extract tool. The Extract tool uses the chromosome (sequence) identifiers in the genome and the coordinate file to make a match. They must be exact. Use BED format for the coordinate input for best results.

All of these are tools found on the public Main Galaxy instance at http://usegalaxy.org. Use the "search" at the top of the tool form to locate them. At the bottom of each tool form is a link to the Tool Shed repository the wrappers are based on.

The tool 'filter_by_fasta_ids' is most likely not what you are needing, but again, I may have misunderstood your question. Please send more details if that is true.

Best, Jen, Galaxy team

ADD COMMENT • link modified 3.8 years ago • written 3.8 years ago by Jennifer Hillman Jackson ♦ 25k

Hi. Sorry about the confusion let me clarify. I have 4 files that contain numerous individual fasta files. 3 of the files are for species that do not have a complete genome sequenced(The 4th file is Squirrel Monkey). (I downloaded all these individual files from GenBank)

I took the 4 files and ran them through RepeatMasker with the preferences of my choice. I now have the masked files from RepeatMasker and the file giving me the begin and end (coordinates) for where I can find the repeat element of my choice in each individual fasta file.

So I was thinking I could use 'filter_by_fasta_id' and use the begin and end points as coordinates for each individual file. However, I was not sure what format the coordinates should be in since I have individual fasta files that may not be from the same region of the genome or may not have chromosome information listed.

I did not think about creating a custom reference Genome but that sounds like a good idea. I will look into the wiki link you posted above.

I hope this clarifies.

Thanks!

-Jasmine

ADD REPLY • link written 3.8 years ago by jasminebro2 • 0

Thanks Jasmine for the extra info. I think the custom genome/Extract method will work for your case. Good luck with the project, Jen, Galaxy team

ADD REPLY • link written 3.8 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »