Problem with fastx collapser

Question: Problem with fastx collapser

17 months ago by

dac330 • 0

dac330 • 0 wrote:

Hello:

I am having a problem working with fastx_collapser in Galaxy.

When attempting to use the Collapse sequences tool, I get the following error message: "fastx_collapser: Invalid input: This looks like a multi-line FASTA file. Line 899 contains a nucleotides string instead of a '>' prefix. FASTX-Toolkit can't handle multi-line FASTA files. Please use the FASTA-Formatter tool to convert this file"

To remedy this problem, I use the FASTA Width formatter (Galaxy Version 1.0.0) to ensure that every odd line is a sequence identifier, and every even line is a nucleotides line. However, I still get the same error message as above.

I've exported the problem fasta file and attempted to view line 899, yet it looks normal to me (it is the sequence identifier, as expected).

I am out of ideas as to what this error message could mean otherwise. Any suggestions would be greatly appreciated.

galaxy • 625 views

ADD COMMENT • link •

modified 11 weeks ago by ariverosw • 0 • written 17 months ago by dac330 • 0

17 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

There is still a format problem within the dataset or the nucleotide strings may be very long or there is some other issue (potentially with the sequence identifier or another formatting problem - extra spaces, tabs, lines). Please be aware that the tool is designed to work with NGS reads, not long full/partially assembled transcripts.

If working at http://usegalaxy.org, or can reproduce the problem there, a bug report can be sent in for feedback to confirm the data issue and possibly help with reformatting that will allow the tool to execute correctly. Please leave the associated datasets undeleted and include a link to this post so we can associate the two. If a bug in the tool is undercovered, we will also want to characterize and fix it, and a bug report is the best way to share how to reproduce the problem.

https://galaxyproject.org/issues/

Thanks! Jen, Galaxy team

ADD COMMENT • link written 17 months ago by Jennifer Hillman Jackson ♦ 25k

11 weeks ago by

ariverosw • 0

ariverosw • 0 wrote:

Hello,

I'm having a similar issue in a different line in the fasta file. I was doing this in Galaxy and I notice that the output error message was from fastx-toolkit. So I went to the HPC and try this command:

fastx_collapser -v -i test.fa -o test2.fa

And I got the same error message I got in Galaxy:

fastx_collapser: Invalid input: This looks like a multi-line FASTA file. Line 3779 contains a nucleotides string instead of a '>' prefix. FASTX-Toolkit can't handle multi-line FASTA files. Please use the FASTA-Formatter tool to convert this file into a single-line FASTA.

This definitely proves the issue is not Galaxy since I tried multiple thing and couldn't fix this issue. The fasta file appears to be good but the collapser keeps failing.

Originally I was doing a Falcon assembly on this fasta file and was crushing, which is the reason why I went to check the input file. Hope I can get some comments and suggestions on what's next in this situation.

Thanks!!!

Alejandro

ADD COMMENT • link written 11 weeks ago by ariverosw • 0

The fasta input to this tool should have all of the sequence content on one line (be "unwrapped") to avoid this specific error.

Two tools can unwrap the sequence lines: NormalizeFasta and FASTA Width formatter.

This is different formatting from a Custom genome/transcriptome, which work best with tools (mappers won't care, but downstream tools will) when formatted with lines wrapped at 80 bases.

ADD REPLY • link modified 11 weeks ago • written 11 weeks ago by Jennifer Hillman Jackson ♦ 25k

Thanks for the reply and yes I know that part and all the sequences are in a single line, but still giving issues. When I remove both the sequences id line and the sequence itself there is no problem, but both of them appear in one line in the screen. Don't know if there are hidden characters that are (like \r or others) that might be creating this issue ...

ADD REPLY • link written 11 weeks ago by ariverosw • 0

HIdden characters could definitely be a factor as could other formatting issues (extra blank lines, IUPAC characters, etc). There are no wrapped tools at Galaxy Main to fully validate or repair nucleotide fasta formatting at this kind of detailed level (yet - however, any tool could be wrapped going forward). If you find one you like, it could be wrapped by you, a colleague, and/or a request could be made to the Galaxy development community/IUC.

There is a tool for validating protein fasta -- for use in your own Galaxy or at the Galaxy-P public server, but that won't help in your case: https://toolshed.g2.bx.psu.edu/view/galaxyp/validate_fasta_database/48c2271171f2

Try a google with "fasta validator" to see what is available open-source around the web for line-command usage now. Or you can use your own line-command tools/methods. The idea would be to get the basic formatting correct first, then upload the data into Galaxy.

ADD REPLY • link modified 11 weeks ago • written 11 weeks ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »