Extract Genomic DNA - no recognized datasets

Question: Extract Genomic DNA - no recognized datasets

3.6 years ago by

Anna • 0

United States

Anna • 0 wrote:

Hi,

I am new to galaxy. I wish to use the Extract Genomic DNA tool, but under "Fetch sequences for intervals in" it says "No interval or gff dataset available." In my history, I have a tabular interval dataset generated by Pileup-to-Interval, which was created via Generate Pileup from BAM. The build is specified under attributes for this dataset.

Any idea why the tool is not recognizing this interval dataset?

My end goal is to get fasta sequences for small (gene-size) genomic intervals for a strain (bam file) mapped to the reference genome. Specifically, I was following Jennifer Hillman-Jackson's suggestion on this query: https://www.biostars.org/p/1388/.

Thank you for any assistance you can offer!

consensus sequence galaxy • 1.1k views

ADD COMMENT • link •

modified 3.6 years ago • written 3.6 years ago by Anna • 0

3.6 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Check the datatype. Also, this tool works best with a file in strict specification for the type. BED works great - and will be easy to convert into from interval format (use the "Add a column" and/or "Cut" function as needed, along with other tools in the Text Manipulation group).

Best, Jen, Galaxy team

ADD COMMENT • link written 3.6 years ago by Jennifer Hillman Jackson ♦ 25k

3.6 years ago by

Anna • 0

United States

Anna • 0 wrote:

Jen,

Thanks so much for your quick response!

Now that I have changed the Dataype of the pileup to "pileup" (the default assignment was "tabular"), it is recognized by the Extract Genomic DNA tool, thank you.

I'm not sure that I follow what you are saying next, though. My goal is to end up with a bunch of extracted sequences and here is what I've decided to try:

1) Slice up a whole genome BAM file using a BED file with a list of short intervals (the coordinates of my genes of interest)

2) Generate pileup from new sliced BAM dataset (which should be much smaller now, right?)

3) Convert to interval format using Pileup-to-Interval

4) Feed interval dataset into Extract Genomic DNA tool to get out the fasta sequences of my genes

It sounds like you are suggesting I use BED in step 4 as well? I am not sure how that fits in, perhaps you are suggesting I use BED to specify my intervals of interest at the final step (instead of in the beginning)? I'm new to this so I'm getting stuck at each point--any advice to point me in a better direction is much appreciated.

Best,

Anna

ADD COMMENT • link written 3.6 years ago by Anna • 0

Your other follow-up comment solves this, which is good. The "Extract" tool pulls out genomic sequence based on coordinates. This will not be representative of any other sequence data (the part that differs from the genomic) that may have been used to identify those regions. So, glad another method worked out! Jen

ADD REPLY • link written 3.6 years ago by Jennifer Hillman Jackson ♦ 25k

3.6 years ago by

Anna • 0

United States

Anna • 0 wrote:

I think I have figured out what I am trying to do, I'm posting my solution here in case it helps somebody else. I am not using Extract Genomic DNA to retrieve the sequences I want after all. By reading the suggestion in this query I figured it out: https://www.biostars.org/p/77642/.

To extract consensus sequences from BAM file:

1) I have a whole-genome BAM file. Slice this using Slice BAM (NGS: SAM Tools) and a BED file that specifies the intervals I eventually want sequences for (in my case, specific loci).

2) Generate pileup from sliced BAM using Generate pileup (NGS: SAM Tools). Select "yes" for Call Consensus. Once this file is generated, look at it and see that it has 10 columns. Column 4 has consensus base calls.

3) Download the file, use perl/python to convert the column of consensus base calls to FASTA. (It did not seem like any of the Galaxy tools were set up to do this.)

ADD COMMENT • link written 3.6 years ago by Anna • 0

Hello, Glad you found out how to do what you needed. To let you know, the last step (#3) can be done in Galaxy. The process will be similar to line command - the tools in Text/Fasta Manipulation, Group, Sort, and the other simple summary/reorganization tools cover most of the unix shell (or perl/python) basic file operations. These are "single operation" tools on purpose - designed to work together in combination as needed to customize anaysis needs - the way one would string together commands separated by pipes. A "Cut" (at least two columns: identifier + sequence, or cut just the sequence then add in a generic identifier using "Add column") followed by "Tabular-to-Fasta" should produce what you want from a 10 column pileup dataset. If it works out, save the method as workflow and use it as a "tool" (hide the intermediate datasets and rename the output to be informative for your project's stages, if wanted). Hope this helps! Jen

ADD REPLY • link modified 3.6 years ago • written 3.6 years ago by Jennifer Hillman Jackson ♦ 25k

Many thanks, much appreciated!

Anna

ADD REPLY • link written 3.6 years ago by Anna • 0

Please log in to add an answer.

Similar posts • Search »