Megablast Output

6.6 years ago by

Peter Cock • 1.4k

European Union

Peter Cock • 1.4k wrote:

I've not tried to track down this reported possible bug in GI numbers, and weather it also affects BLAST+ as well as the legacy NCBI BLAST (which has now been discontinued). Do you have a specific example. As to the 12 columns, they are standard BLAST tabular output, and should match the defaults in BLAST+ tabular output which are: Column NCBI name Description 1 qseqid Query Seq-id (ID of your sequence) 2 sseqid Subject Seq-id (ID of the database hit) 3 pident Percentage of identical matches 4 length Alignment length 5 mismatch Number of mismatches 6 gapopen Number of gap openings 7 qstart Start of alignment in query 8 qend End of alignment in query 9 sstart Start of alignment in subject (database hit) 10 send End of alignment in subject (database hit) 11 evalue Expectation value (E-value) 12 bitscore Bit score Peter

ADD COMMENT • link written 6.6 years ago by Peter Cock • 1.4k

Peter, you requested an example, here are the first five hits for my first query sequence (OTU#0) 0 324034994 527 93.23 266 13 5 1 265 22 283 7e-102 379.0 0 56181650 513 93.26 267 10 8 1 265 25 285 7e-102 379.0 0 314913953 582 91.79 268 13 9 1 265 24 285 2e-92 347.0 0 305670062 281 92.52 254 14 5 4 256 32 281 2e-92 347.0 0 310814066 1180 91.73 266 14 7 1 265 24 282 9e-92 345.0 You will notice there are 13 columns, one in addition to the 12 column titles you explained. This is because there is a column between sseqID and pident. In the metagenomic tutorial the first 4 columns are explained, and column 3 is described as length of sequence in database (or length of the subject sequence). This is the problem column. The length of only one of the subject GI numbers above match the subject length in NCBI. This has caused me to wonder if I can trust the hit info. In all cases that I've checked, when this happens the correct match is the listed GI value minus 1 (ie, in NCBI, gi|324034994 is not 527nt long, but 324034993 IS 527nt long).

ADD REPLY • link written 6.6 years ago by Sarah Hicks • 60

I see now - the megablast_wrapper.py calls megablast (from the old legacy NCBI blast suite) which does indeed produce 12 column tabular output. But the wrapper script then edits the output: It appears to be splitting column 2 in two at the underscore intended to give the match ID and the length. This puzzles me but I haven't used the legacy BLAST tabular output for a while. On BLAST+ you can ask for the query or subject length explicitly as their own columns so we don't have this problem. The megablast_wrapper.py also re-formats the floating point score in the last column, apparently the NCBI style could cause problems with the Galaxy filter tool. That is strange. Peter

ADD REPLY • link written 6.6 years ago by Peter Cock • 1.4k

So, Jen, I'm not sure if we're talking about the same ID change... I am under the impression that GenBank does not change it's GI numbers for it's entries. Plus, it's now looking like all sequence length output info for each hit through Galaxy's megablast does not match to the GI number output given by Galaxy megablast, but to the GI number before it. Because the "-1" rule is so consistent, it makes this seem less and less like it has to do with NCBI changing it's GI numbers to make room for new entries or something. In other words, there is a shift, as if a 1 was added to each NCBI GI number in galaxy before galaxy produces the output file. I need someone to tell me if I can trust the output. Basically I see it this way. Every hit row from Galaxy megablast actually has information for two NCBI entries: the one that shares the GI output and the one before it that shares the sequence length output. Which one is the hit I should be using? Because on some occations, the NCBI entry that shares the GI output from galaxy is VERY distantly related to the NCBI entry that shares the subject sequence length output from galaxy, and I don't know which to pick. Is this problem well understood, yet?

ADD REPLY • link written 6.6 years ago by Sarah Hicks • 60

Hi Sarah, We appreciate all of the information you have provided and have been working here since yesterday to investigate the issue in more detail. This includes incorporating the additional data both you and Peter have been posting. We don't have anything conclusive to report yet, but it would have been considerate to send an update this morning to let you know what we were doing. Please accept my apologies for not doing so - we are in fact in complete agreement that as the data currently presents, something odd appears to be going on. Genbank updates would be unrelated as gi numbers do not change through time (although they can be retired, but again, not related to this case). The question of the mismatch in the wrapped Megablast output between gi and reported length is the open issue to be addressed. A reply will be send as soon as the root cause is determined. If there is indeed a problem, this would of course be considered a priority to correct. Not that we are expecting delays, but if your analysis is very urgent, using the BLAST+ BLASTN megablast wrapper that Peter authored, in a local or cloud instance, would be the best immediate remedy (this version has the standard 12 column output). Sequence length data could always be obtained from Genbank and added into these results using other Galaxy tools (column join, etc.). Thank you and Peter both for the help and for your patience! Best, Jen Galaxy team -- Jennifer Jackson http://galaxyproject.org

ADD REPLY • link written 6.6 years ago by Jennifer Hillman Jackson ♦ 25k

Thanks for the update! I am not in a huge rush, but I was worried I had not been describing the problem clearly, as it took quite a while to explain to colleagues here :) best

ADD REPLY • link written 6.6 years ago by Sarah Hicks • 60

Getting the query and match sequence lengths is even simpler that that with the BLAST+ wrappers - just select the "extended" tabular output. Of course, you'll need to adjust the downstream analysis to take into account the different column numbers. Peter

ADD REPLY • link written 6.6 years ago by Peter Cock • 1.4k

Thanks Peter, Excellent point. From there, the "Cut" tool could be used to reorganize the output to exactly match that of the 13-column regular megablast output. So, no external data needed, no tool modifications needed. This can't be done on the main public Galaxy instance as BLAST+ is not available there, but for any local/cloud instance this is an alternative certainly work testing. Best, Jen Galaxy team -- Jennifer Jackson http://galaxyproject.org

ADD REPLY • link written 6.6 years ago by Jennifer Hillman Jackson ♦ 25k

Ok, I'll work on installing this...

ADD REPLY • link written 6.6 years ago by Sarah Hicks • 60

I see Anton has just updated the Galaxy megablast tool to use BLAST+ internally now, getting blastn to produce the specific 13 columns desired. https://bitbucket.org/galaxy/galaxy-central/changeset/0b5cb60e4810 So once that change goes live it will probably resolve this issue. Peter

ADD REPLY • link written 6.6 years ago by Peter Cock • 1.4k

Dear all, Sometimes ago, Ive reported on this list the same problem with megablast than Sarah mentioned. I finally used another way to analyse my data but my conclusion was similar to Sarah one with most of the time a shift of Ť -1 ť between the GI number in the output and the following parameters in the other columns. One of the possible explanations was maybe a change in the GI number according to the version of the databank used. However, Im not so sure that this explanation is the right one and Im not so sure about the consistency of this Ť GI-1 ť rule. Indeed, Ive made a simple test by using a known and well identified sequence taken at random in Genbank (GI 2924630 in NCBI, AB002412.1, Elephas maximus mitochondrial DNA for cytochrome b, 1137 bp, so called TEST-SEQ below). Using this sequence as template for megablast in Galaxy gave the following results (3 first lines): TEST-SEQ 2924736 1137 100.00 1137 0 0 1 1137 1 1137 0.0 2254.0 TEST-SEQ 155573765 16831 99.91 1137 1 0 1 1137 14149 15285 0.0 2246.0 TEST-SEQ 2924608 1137 99.74 1137 3 0 1 1137 1 1137 0.0 2230.0 As you can see, the parameters concerning the first match seem correct : 1137 bp and 100% of identity. However, the GI number is not the good one (2924736 instead of 2924630) and we dont have a Ť GI-1 ť rule here. I used the Fetch Taxonomic Representation command available in Galaxy to fetch the taxonomy for the 3 sequences above and 2 out 3 GI numbers correspond to very distant taxa I did a megablast on the NCBI (so, the database is more recent) to compare and here are the results : TEST-SEQ gi|2924630|dbj|AB002412.1| 100.00 1137 0 0 1 1137 1 1137 0.0 2100 TEST-SEQ gi|37496447|emb|AJ428946.1| 99.91 1137 1 0 1 1137 14149 15285 0.0 2095 TEST-SEQ gi|2924606|dbj|D50844.1| 99.74 1137 3 0 1 1137 1 1137 0.0 2084 At this step, I would say that the table given in output of the megablast in Galaxy is good for all parameters except for the GI of the database hit (column 2)and I'm not able to say why. Im not sure that it can help Hoping that you will be able to clarify this possible problem in the megablast output, Best, Sandrine 2012/4/24 Sarah Hicks <garlicscape@gmail.com>:

ADD REPLY • link written 6.6 years ago by Sandrine Hughes • 50

6.6 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Sarah, Peter defined the columns (thanks) but I can provide some information about the GenBank identifiers. The megablast database on the public server are roughly a year old and there have been updates at NCBI since that time. As I understand it, this manifests as occasional mismatches between hits at Galaxy vs Genbank when comparing certain IDs linked to updated records. We are working to update these three databases, but there are some complicating factors around this processing specifically related to the public instance and the metagenomics workflow that have yet to be resolved. Please know that getting updated is a priority for us and we apologize for the inconvenience. To use the most current databases, a local or (better) cloud instance with either the regular or BLAST+ version of the tool and a database your choice is the recommendation. Instructions to get started are at: getgalaxy.org getgalaxy.org/cloud Hopefully this explains the data mismatch. This question has come up before, but I think you are correct in that the final conclusion never was posted back to the galaxy-user list (for different reasons). So, thank you for asking so we that could send out a clear reply for everyone using the tool. Best, Jen Galaxy team -- Jennifer Jackson http://galaxyproject.org

ADD COMMENT • link written 6.6 years ago by Jennifer Hillman Jackson ♦ 25k

Thanks so much for the prompt reply. I don't mind using last years GenBank, as long as I am getting accurate hits. I just have a couple more questions to confirm I am safe using the Galaxy pipline for this... So if I continue to work within the the 1 year old database, can I trust the output as accurate matches? Specifics about my project: I have environmental samples that were sequenced for fungal ITS. I have clustered these into OTUs, and chosen a representative sequence for each. If I retrieve hits for this representative sequence file in my sample, can I trust the hits as being the correct hits as of last year? I'm just worried about what that one person said who thought there was some column arrangement problems, because I'm finding that I'm getting hits from different phylum for the same sequence using default parameters in megablast... Can I also assume, then, that I should NOT identify my representative sequence file to updated GI numbers using another pipeline, and then bring the file of GI numbers to Galaxy to fetch taxonomic assignments? (which I would do because of the nice neat columns for each taxonomic level Galaxy puts out) Sarah

ADD REPLY • link written 6.6 years ago by Sarah Hicks • 60

Similar posts • Search »