Question: Extract Genomic DNA - no recognized datasets
0
gravatar for Anna
3.6 years ago by
Anna0
United States
Anna0 wrote:

Hi,

I am new to galaxy. I wish to use the Extract Genomic DNA tool, but under "Fetch sequences for intervals in" it says "No interval or gff dataset available." In my history, I have a tabular interval dataset generated by Pileup-to-Interval, which was created via Generate Pileup from BAM. The build is specified under attributes for this dataset.

Any idea why the tool is not recognizing this interval dataset?

My end goal is to get fasta sequences for small (gene-size) genomic intervals for a strain (bam file) mapped to the reference genome. Specifically, I was following Jennifer Hillman-Jackson's suggestion on this query: https://www.biostars.org/p/1388/.

Thank you for any assistance you can offer!

consensus sequence galaxy • 1.1k views
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Anna0
1
gravatar for Jennifer Hillman Jackson
3.6 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

Check the datatype. Also, this tool works best with a file in strict specification for the type. BED works great - and will be easy to convert into from interval format (use the "Add a column" and/or "Cut" function as needed, along with other tools in the Text Manipulation group).

Best, Jen, Galaxy team

ADD COMMENTlink written 3.6 years ago by Jennifer Hillman Jackson25k
0
gravatar for Anna
3.6 years ago by
Anna0
United States
Anna0 wrote:

Jen,

Thanks so much for your quick response! 

Now that I have changed the Dataype of the pileup to "pileup" (the default assignment was "tabular"), it is recognized by the Extract Genomic DNA tool, thank you.

I'm not sure that I follow what you are saying next, though. My goal is to end up with a bunch of extracted sequences and here is what I've decided to try:

1) Slice up a whole genome BAM file using a BED file with a list of short intervals (the coordinates of my genes of interest)

2) Generate pileup from new sliced BAM dataset (which should be much smaller now, right?)

3) Convert to interval format using Pileup-to-Interval

4) Feed interval dataset into Extract Genomic DNA tool to get out the fasta sequences of my genes

It sounds like you are suggesting I use BED in step 4 as well? I am not sure how that fits in, perhaps you are suggesting I use BED to specify my intervals of interest at the final step (instead of in the beginning)? I'm new to this so I'm getting stuck at each point--any advice to point me in a better direction is much appreciated.

Best,

Anna

ADD COMMENTlink written 3.6 years ago by Anna0

Your other follow-up comment solves this, which is good. The "Extract" tool pulls out genomic sequence based on coordinates. This will not be representative of any other sequence data (the part that differs from the genomic) that may have been used to identify those regions. So, glad another method worked out! Jen

ADD REPLYlink written 3.6 years ago by Jennifer Hillman Jackson25k
0
gravatar for Anna
3.6 years ago by
Anna0
United States
Anna0 wrote:

I think I have figured out what I am trying to do, I'm posting my solution here in case it helps somebody else. I am not using Extract Genomic DNA to retrieve the sequences I want after all. By reading the suggestion in this query I figured it out: https://www.biostars.org/p/77642/.

To extract consensus sequences from BAM file:

1) I have a whole-genome BAM file. Slice this using Slice BAM (NGS: SAM Tools) and a BED file that specifies the intervals I eventually want sequences for (in my case, specific loci).

2) Generate pileup from sliced BAM using Generate pileup (NGS: SAM Tools). Select "yes" for Call Consensus. Once this file is generated, look at it and see that it has 10 columns. Column 4 has consensus base calls.

3) Download the file, use perl/python to convert the column of consensus base calls to FASTA. (It did not seem like any of the Galaxy tools were set up to do this.)

 

ADD COMMENTlink written 3.6 years ago by Anna0
1

Hello, Glad you found out how to do what you needed. To let you know, the last step (#3) can be done in Galaxy. The process will be similar to line command - the tools in Text/Fasta Manipulation, Group, Sort, and the other simple summary/reorganization tools cover most of the unix shell (or perl/python) basic file operations. These are "single operation" tools on purpose - designed to work together in combination as needed to customize anaysis needs - the way one would string together commands separated by pipes. A "Cut" (at least two columns: identifier + sequence, or cut just the sequence then add in a generic identifier using "Add column") followed by "Tabular-to-Fasta" should produce what you want from a 10 column pileup dataset. If it works out, save the method as workflow and use it as a "tool" (hide the intermediate datasets and rename the output to be informative for your project's stages, if wanted). Hope this helps! Jen

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Jennifer Hillman Jackson25k

Many thanks, much appreciated!

Anna

ADD REPLYlink written 3.6 years ago by Anna0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 172 users visited in the last hour