Question: Duplicate seq ids in uniref50
1
gravatar for keren.maor
5 months ago by
keren.maor40
keren.maor40 wrote:

Dear all, I was trying to generate a protein database using the Galaxy tool NCBI BLAST+ makeblastdb, with a fasta file of Uniref50 (downloaded here: http://www.uniprot.org/downloads). This is so I could do blastP to my Transdecoder output of de novo transcriptome assembly. I received an error message :"30757757 sequences Fatal error: Exit code 1 () BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID:2599973" Does anybody know how I can fix this in Galaxy? What shall I do? Thanks in advance

rna-seq uniref50 blast galaxy • 338 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by keren.maor40
0
gravatar for Jennifer Hillman Jackson
5 months ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

Duplicated fasta IDs seems unlikely from this source.

I used the tool NormalizeFasta on the downloaded fasta file to strip off the extra annotation on the title line (can cause problems with tools). The options should be set to wrap the sequences at 80 bases and to remove title line content (">" lines) after the first whitespace. This results in just the fasta IDs being retained.

I'm testing the makeblastdb tool on that to see what happens next. More feedback once completed. The data is large, so will take some time to process. Meanwhile, you could also try to do the same (normalize first, then run the tool).

Thanks and I'll follow up soon, Jen, Galaxy team

ADD COMMENTlink written 5 months ago by Jennifer Hillman Jackson25k
0
gravatar for keren.maor
5 months ago by
keren.maor40
keren.maor40 wrote:

Thank you Jen for your reply. I will try it as well. did you mean to turn the option :"Truncate sequence names at first whitespace" to Yes?

ADD COMMENTlink written 5 months ago by keren.maor40

Correct, use that option.

This is true when using most fasta datasets in Galaxy (and frankly, also when used line command - some tools are pickier about format than others). This FAQ is for custom genomes, but has good general fasta formatting advice: https://galaxyproject.org/learn/custom-genomes/

ADD REPLYlink written 5 months ago by Jennifer Hillman Jackson25k

I tried to use the tool NormalizeFasta (basically cut down the uniref annotation and left with only the id code), and than to Makeblastdb, but again I got the error message :"30757757 sequences Fatal error: Exit code 1 () BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID:2599973" do you have other idea what I should try? Thank you again

ADD REPLYlink written 5 months ago by keren.maor40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 183 users visited in the last hour