Convert FASTAQ file into plain text whole genome sequence

Question: Convert FASTAQ file into plain text whole genome sequence

3.0 years ago by

Malaysia

I have a FASTAQ file containing millions of sequences and I want a simple script to convert this file into one long sequence. ie delete all headers and remove any spaces and line breaks. I can always add a ">seq_name" to the first line afterwards, so maintaining the top header is not necessary.
I've searched the forums but can only find scripts that do the reverse. I'm using millions of reads as a substitute for a complete genome, and my current pipeline cannot reconcile this, so I want to trick it into thinking that this is one long genome sequence.
Thanks for any help!!!

Sample of my input FASTAQ file,

>lcl|NC_003198.1_gene_1 [locus_tag=Salmonella Typhi0001] [location=190..255]

AAAAGCNGGTTATGTTGTCGCTTTACGGTTTTCATTCAGGACGCGCTATGGGCAATAAGTATTCCGGCCTGCAAATTGGTATTCACTGGTTAGTCTTTT
>lcl|NC_003198.1_gene_14 [locus_tag=Salmonella Typhi] [location=15020..15967]
TATCGCGNCGTTTTTACGCTGGCGTCACCGTCACCAATAAACCTTAGCGCGCTGGAGGAAATATCCCAGCGCGAAATTTATCGCCCCATAAACCGCGCC

Sample of my output FASTA file should be formatted as follows,

>|SalmonellaTyphi|TAAAAGCNGGTTATGTTGTCGCTTTACGGTTTTCATTCAGGACGCGCTATGGGCAATAAGTATTCCGGCCTGCAAATTGGTATTCACTGGTTAGTCTTTTTATCGCGNCGTTTTTACGCTGGCGTCACCGTCACCAATAAACCTTAGCGCGCTGGAGGAAATATCCCAGCGCGAAATTTATCGCCCCATAAACCGCGCCC

sequence sequence format • 814 views

ADD COMMENT • link •

modified 3.0 years ago by Jennifer Hillman Jackson ♦ 25k • written 3.0 years ago by meeran_micro • 0

3.0 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

This functionality does not exist yet. A ticket has been created to make the enhancement request: https://github.com/galaxyproject/tools-devteam/issues/305. Please feel free to add comments to the ticket to clarify any specific goals you may have that are not currently covered.

There are other tools in Galaxy that will merge data in tabular format (fasta could be converted to tabular to use them), but these do not act on dataset collections. Instead, individual datasets are specified. This would be impractical with a large number of sequence datasets (probably anything over 10 in the user interface). An API script could be developed to input multiple datasets, but again, if the number of very large (1 M is very large), again this becomes impractical.

A tool that acts on one fasta file or a collection of fasta files is the best solution, and that is what the enhancement request above addresses. I suggested a tool in the ticket above, but there could be others that exist and writing a tool to do this work is probably not difficult. If you wish to create a wrapper and/or tool for this function, that would be a welcomed addition to the Tool Shed.

Thanks! Jen, Galaxy team

ADD COMMENT • link written 3.0 years ago by Jennifer Hillman Jackson ♦ 25k

Update: I missed a tool when I was looking for a solution. jmchilton referenced it in the ticket I created.

Look for it in the Tool Shed: https://toolshed.g2.bx.psu.edu/repository?repository_id=1648599b6784efd6&changeset_revision=2904d46167da

Tool name: fasta_merge_files_and_filter_unique_sequences

Use it with this tool to create the dataset collection: splitfasta (also in the Tool Shed)

These tools would be for use in a local or cloud Galaxy at this time, but I asked to see if both could be added to http://usegalaxy.org in the future.

If you decide to try these, feedback about how they function with such a large number of items in the dataset collection would be useful for our team to learn about.

Thanks! Jen

ADD REPLY • link written 3.0 years ago by Jennifer Hillman Jackson ♦ 25k

3.0 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Another solution that requires no new tools (via Bjorn Gruening). This was posted at Github, but is worth sharing here directly.

Exact method:

Use Select twice on the input fasta dataset
- NOT MATCHING ">" (no quotes)
- NOT MATCHING "^$" (no quotes - is extra just in case there are empty blank lines)
Create a new title line using the tool Upoad and type/paste in the content
Concatinate the new title line with the sequence data, placing the title line on top
Run FASTA Width formatter on the resulting dataset (60 is a good width choice, but values between 40-80 are an option accepted by most tools)

Best, Jen, Galaxy team

ADD COMMENT • link written 3.0 years ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »