Troubles with full settings on Lastz

Question: Troubles with full settings on Lastz

3.4 years ago by

eddreuzy • 0

France

eddreuzy • 0 wrote:

Hi,

I am new to bioinformatics and I have troubles using Lastz.

I use it to detect a 39nt sequences at the beginning of the reads.

I want to keep all reads with no more than 3mismatch/39nt and 1gap/insertion.

I had good results using yasra setting "454 75%homology" and filtering for 90%homology and 100%coverage.

Nevertheless the reads with gaps at either the begining or the end of the 39nt sequence fail the mapping.

How can I use the full settings to reproduce this yasra setting while including reads with 1 gap within the first or last nucleotides of the query sequence ?

Thank you for your help, I tried a lot of settings already but couldn't manage to get what I need...

Edouard

lastz alignment galaxy • 1.3k views

ADD COMMENT • link •

modified 3.4 years ago by Bob Harris • 190 • written 3.4 years ago by eddreuzy • 0

Hello, A reply from Bob Harris (the developer of Lastz) is pending. It may be a day or so before he can respond, but it will be the best, most complete answer. I will also bookmark and track. Thank you for your patience, Jen, Galaxy team

ADD REPLY • link written 3.4 years ago by Jennifer Hillman Jackson ♦ 25k

3.4 years ago by

Bob Harris • 190

United States

Bob Harris • 190 wrote:

Howdy, Edouard,

I've tried to reproduce this failure using command-line lastz and was unable to.

Specifically, I created a random 39nt sequence as my target sequence, then created two 100nt sequences with the 39nt sequence embedded at positions 1..39 and at 2..40. I ran lastz with the 39nt sequence as the reference, and the two 100nt sequences as reads, with the command-line option corresponding to "Roche-454 75% Identity", and filtering for 90% identity and 100% coverage. Alignments for both reads were reported.

I might not be correctly interpreting what you mean by a gap "at either the begining or the end of the 39nt sequence". Could you provide examples of reads that fail for both those conditions?

Bob H

(lastz author)

ADD COMMENT • link written 3.4 years ago by Bob Harris • 190

3.4 years ago by

eddreuzy • 0

France

eddreuzy • 0 wrote:

Hi,

Thank you for that help!

Sorry, I realized that what I wrote was not very clear,

Here is the context : I want to map viral insertion sites in the genome. to that aim, I amplify vector -genome junctions by PCR. Therefore I end up with reads containing a 39nt sequence corresponding to the 3' end of the viral genome followed by the human gDNA flanking the viral genome.

I need to select valid reads (containing the 39nt with ~90% homology /1ins-del, Figure (1)) and trim that sequence to keep only the human DNA sequence to map (figure (2)). To precisely determine the viral insertion site (figure (3)) I need to remove precisely the complete viral sequence : if there is an insertion or a deletion I need to know that the trimming must be done on only the first 38nt or 40nt instead of 39nt; so that the +1 of the sequence still matches the insertion site.

I have also created sequences to see what would be kept :

the reference sequence is the following :

GATCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTA. I added this sequence in 3' GCAGGTGGTGGTGGTGGTGGTGGTGGTGGT

The reads that were losts are :

GTATCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGGTGGTGGTGGTGGTGGTGGTGGTGGT

GATTCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGGTGGTGGTGGTGGTGGTGGTGGTGGT

GATCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCCTAGCAGGTGGTGGTGGTGGTGGTGGTGGTGGT

GATCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTTAGCAGGTGGTGGTGGTGGTGGTGGTGGTGGT

G-TCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGGTGGTGGTGGTGGTGGTGGTGGTGGT

GA-CCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGGTGGTGGTGGTGGTGGTGGTGGTGGT

GATCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTC-AGCAGGTGGTGGTGGTGGTGGTGGTGGTGGT

GATCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCT-TAGCAGGTGGTGGTGGTGGTGGTGGTGGTGGT

Do you know if there is a way to get these sequences back ?

By the way in the results file I get the end position of the mapped sequence on each read.

Do you know any tool that could use this value to trim each sequence to its specific ending position ?

Thank you a lot for your help !

Edouard

ADD COMMENT • link modified 3.4 years ago • written 3.4 years ago by eddreuzy • 0

3.4 years ago by

eddreuzy • 0

France

eddreuzy • 0 wrote:

Ok, just to say that I found out how to trim each read to the right position.

(I will split the file in multiple files, each one for a particular lentgh, perform the trimming and pool together the trimmed sequences)

Edouiard

ADD COMMENT • link written 3.4 years ago by eddreuzy • 0

3.4 years ago by

Bob Harris • 190

United States

Bob Harris • 190 wrote:

I've tried the example sequences with command line lastz and it looks like the problem is that galaxy must be using an older version of lastz. A change made in lastz version 1.03.00 improves the tolerance for mismatches and indels near the end of reads.

Specifically, version 1.03.00, with your parameter settings, reports 7 of those 8 reads. The 5th one is still missed, but that's because the resulting alignment doesn't cover all 39 nt of the reference, so it is discarded because it fails the 100% coverage criteria. The reason it doesn't align to the whole 39 is that it has a choice between the following two alignments. The first alignment has a lower score, since opening a gap for the A has a larger penalty than treating the A as a substitution. I'm not sure how you might deal with this; perhaps you could relax coverage to, say, 97%, and then add a post processing step to make sure the alignment begins at query base one and reference close to one.

ref    GATCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTA
query5 G-TCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTA

ref     ATCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTA
query5  GTCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTA

Regarding the change that was made in version 1.03.00: the original design of lastz (which was intended for local alignments in chromosomes) was found to have a shortcoming when the true alignment is interrupted by the end of a sequence. When a mismatch or gap occurs near the end, a higher score can be achieved by just discarding the end. This is explained in greater detail here: http://www.bx.psu.edu/~rsharris/lastz/newer/README.lastz-1.03.66.html#adv_shadow

This was fixed by the addition of the --noytrim option. In general terms, that option directs lastz to keep an alignment that reaches the end of the sequence, even if that lowers the alignment score (the specifics are more complicated than that). This option was added in version 1.1.50. However, it was not incorporated into the option group settings that galaxy uses (e.g. "Roche-454 75% Identity"). I think galaxy is using something newer than 1.1.50, but the galaxy interface doesn't allow the user to specify lastz options that the galaxy interface doesn't know about. So even though a command line user could just add --noytrim to the command line, galaxy users couldn't do that.

Eventually I realized that --noytrim needed to be added to those option group settings, and did so in 1.03.00, along with a couple other options that improved things for aligning sequenced reads. There is some description of that here: http://www.bx.psu.edu/~rsharris/lastz/newer/README.lastz-1.03.66.html#options_yasra . But from the fact that you are having this problem I infer galaxy must be using something earlier than 1.03.00.

Unfortunately that doesn't really solve your problem, at least not within galaxy.

Bob H

ADD COMMENT • link written 3.4 years ago by Bob Harris • 190

3.4 years ago by

eddreuzy • 0

France

eddreuzy • 0 wrote:

Thank you very much for that detailed answer,

I did see the existence of the --noytrim option in the command line Lastz and could not find it on Galaxy..

I will try to find another way around by reducing the query coverage as you advised,

ADD COMMENT • link written 3.4 years ago by eddreuzy • 0

3.4 years ago by

Bob Harris • 190

United States

Bob Harris • 190 wrote:

Another thought. You could try appending some unique sequence to both the reference and the queries before alignment. For example, appending "ACGTACGTACGT" as a prefix should make the alignment extend past any error near the left end of the real sequence.

I just tried that (command line), using the older lastz params, and also removing the coverage option. It reported all 8 of your example reads, but the right end of the reference was not always included in the alignment. The --noytrim option would be needed to get that end.

Not a particularly great solution, but it might be workable.

ADD COMMENT • link written 3.4 years ago by Bob Harris • 190

Please log in to add an answer.

Similar posts • Search »