Duplicate seq ids in uniref50

Question: Duplicate seq ids in uniref50

5 months ago by

keren.maor • 40 wrote:

Dear all, I was trying to generate a protein database using the Galaxy tool NCBI BLAST+ makeblastdb, with a fasta file of Uniref50 (downloaded here: http://www.uniprot.org/downloads). This is so I could do blastP to my Transdecoder output of de novo transcriptome assembly. I received an error message :"30757757 sequences Fatal error: Exit code 1 () BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID:2599973" Does anybody know how I can fix this in Galaxy? What shall I do? Thanks in advance

rna-seq uniref50 blast galaxy • 338 views

ADD COMMENT • link •

modified 5 months ago • written 5 months ago by keren.maor • 40

5 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Duplicated fasta IDs seems unlikely from this source.

I used the tool NormalizeFasta on the downloaded fasta file to strip off the extra annotation on the title line (can cause problems with tools). The options should be set to wrap the sequences at 80 bases and to remove title line content (">" lines) after the first whitespace. This results in just the fasta IDs being retained.

I'm testing the makeblastdb tool on that to see what happens next. More feedback once completed. The data is large, so will take some time to process. Meanwhile, you could also try to do the same (normalize first, then run the tool).

Thanks and I'll follow up soon, Jen, Galaxy team

ADD COMMENT • link written 5 months ago by Jennifer Hillman Jackson ♦ 25k

5 months ago by

keren.maor • 40

keren.maor • 40 wrote:

Thank you Jen for your reply. I will try it as well. did you mean to turn the option :"Truncate sequence names at first whitespace" to Yes?

ADD COMMENT • link written 5 months ago by keren.maor • 40

Correct, use that option.

This is true when using most fasta datasets in Galaxy (and frankly, also when used line command - some tools are pickier about format than others). This FAQ is for custom genomes, but has good general fasta formatting advice: https://galaxyproject.org/learn/custom-genomes/

Galaxy FAQs: https://galaxyproject.org/support/
Galaxy Tutorials: https://galaxyproject.org/learn/

ADD REPLY • link written 5 months ago by Jennifer Hillman Jackson ♦ 25k

I tried to use the tool NormalizeFasta (basically cut down the uniref annotation and left with only the id code), and than to Makeblastdb, but again I got the error message :"30757757 sequences Fatal error: Exit code 1 () BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID:2599973" do you have other idea what I should try? Thank you again

ADD REPLY • link written 5 months ago by keren.maor • 40

Please log in to add an answer.

Similar posts • Search »