Dear all, I was trying to generate a protein database using the Galaxy tool NCBI BLAST+ makeblastdb, with a fasta file of Uniref50 (downloaded here: http://www.uniprot.org/downloads). This is so I could do blastP to my Transdecoder output of de novo transcriptome assembly. I received an error message :"30757757 sequences Fatal error: Exit code 1 () BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID:2599973" Does anybody know how I can fix this in Galaxy? What shall I do? Thanks in advance
Duplicated fasta IDs seems unlikely from this source.
I used the tool NormalizeFasta on the downloaded fasta file to strip off the extra annotation on the title line (can cause problems with tools). The options should be set to wrap the sequences at 80 bases and to remove title line content (">" lines) after the first whitespace. This results in just the fasta IDs being retained.
I'm testing the makeblastdb tool on that to see what happens next. More feedback once completed. The data is large, so will take some time to process. Meanwhile, you could also try to do the same (normalize first, then run the tool).
Thanks and I'll follow up soon, Jen, Galaxy team