Question: Stitch MAF blocks - can gaps in reference be kept?
3
gravatar for kjsiddle
3.8 years ago by
kjsiddle30
France
kjsiddle30 wrote:

Dear all,

I am using the galaxy tool "Stitch MAF blocks" to extract multiple alignments (using the 28-way multiZ) and concatenate them into a single sequence per species. I have noticed that the resulting hg18 sequence in the fasta file contains no gaps (even though there are gaps within the individual blocks). For my downstream applications I need to keep these gaps.

Does anyone know if there is a way with this tool to keep the gaps? Alternatively, can anyone suggest another approach that might solve this problem.

Many thanks in advance,

Katherine 

galaxy • 1.3k views
ADD COMMENTlink modified 3.8 years ago by Jennifer Hillman Jackson25k • written 3.8 years ago by kjsiddle30
0
gravatar for Jennifer Hillman Jackson
3.8 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

Please review the tool option "Split into Gapless MAF blocks:" near the bottom of the form. Currently, this is the method to preserve any gaps that may occur with this tool in the primary genome.

Best, Jen, Galaxy team

ADD COMMENTlink written 3.8 years ago by Jennifer Hillman Jackson25k

Thank you for your response. I have actually already tried this option and there was no difference (i.e. no gaps in hg18) when I used both "yes" and "no". Just as an example, one of the regions I am trying to extract is this one: chr12(+):9111571-9111685.

When I use "stitch" I get the following:

>hg18.chr12(+):9111571-9111685
ACATTGACCAGAAAAAGTGTTTATTCATCAAGTCTTTAAAGATACAAAAACACGTGTCTTCTGTGGAGCTCTGAGAACAGGACTCCAGCAAAGCACTTTTCAGCCTTGTGGTCT

However, if I use extract MAF blocks followed by MAF to fasta I get:

>hg18
ACATT---------GACC-AGAAAAAGTGTTTATTCATCAAGTCTTT----------------------AAAGATAC--AAAAACA--CGTGTCTTCTGTGGAGCTCTGAGAACAGGACT-CCAGCAAAGCACTTTTCAGCCTTGTGGTCT

I could use the latter option, however, it has two principle limitations. 1) For sequences on the negative strand additional processing is needed, which seems to be automatically handled by stitch, and 2) I would need to do extract and convert each sequence individually as MAF to Fasta will concatenate all sequences in the file. As I am searching for around 10,000 sequences genome-wide this doesn't seem tractable, unless there is a way to loop over a list of queries in Galaxy.

The tools on galaxy are great for handling what is a pretty difficult file type, but I am just stuck on this one, simple, thing! Any suggestions for how to get around it will be greatly appreciated.

Many thanks,

Katherine

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by kjsiddle30
1

Hello,

This is how the tools function. But you can use a combination of other tools to extract the sequences directly from the MAF output and reverse compliment any reported on the (-) strand. I shared an example workflow (below) that will do this in batch for the entire output from the "Extract MAF Blocks" tool. I'll leave it shared for a few days until you let me know that you have it.

Once you have a copy, you can edit it any way you want. For example, if you wanted to save more information from the MAF line, you can add in a step like "Text Manipulation -> Merge Columns" once the data is converted to a tabular format, before running the other steps. Or get very fancy and "Text Manipulation -> Add column" with an underscore, and use that underscore between the fields when running Merge, to make the sequence identifier more readable. This will not be the same sequence identifier as "Stitch MAF Blocks" produces, but may be enough.

 https://usegalaxy.org/u/jen-bx-galaxy-edu/w/maf-sequences

Thanks! Jen, Galaxy team

 

ADD REPLYlink written 3.8 years ago by Jennifer Hillman Jackson25k

Hello,

I am interested in a very similar problem to the original poster. But I can't seem to access the workflow that you have posted?

Would you be able to share it with me please?

Many thanks,

Thomas

ADD REPLYlink written 2.9 years ago by Thomas0

Hi Thomas, Let me see if I still have this. If not, will create a permanent version then share it back here. Jen, Galaxy team

ADD REPLYlink written 2.9 years ago by Jennifer Hillman Jackson25k
1

Hello again. I created an example workflow under Shared Data: Published Workflows. This can be imported and edited to suit different analysis needs by everyone.

 https://usegalaxy.org/u/jen/w/bed-to-maf-to-fasta

Hopefully this helps! Jen, Galaxy team

ADD REPLYlink written 2.9 years ago by Jennifer Hillman Jackson25k

Thank you, much appreciated

Thomas

 

ADD REPLYlink written 2.9 years ago by Thomas0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 183 users visited in the last hour