I have a FASTAQ file containing millions of sequences and I want a simple script to convert this file into one long sequence. ie delete all headers and remove any spaces and line breaks. I can always add a ">seq_name" to the first line afterwards, so maintaining the top header is not necessary.
I've searched the forums but can only find scripts that do the reverse. I'm using millions of reads as a substitute for a complete genome, and my current pipeline cannot reconcile this, so I want to trick it into thinking that this is one long genome sequence.
Thanks for any help!!!
Sample of my input FASTAQ file,
>lcl|NC_003198.1_gene_1 [locus_tag=Salmonella Typhi0001] [location=190..255]
AAAAGCNGGTTATGTTGTCGCTTTACGGTTTTCATTCAGGACGCGCTATGGGCAATAAGTATTCCGGCCTGCAAATTGGTATTCACTGGTTAGTCTTTT
>lcl|NC_003198.1_gene_14 [locus_tag=Salmonella Typhi] [location=15020..15967]
TATCGCGNCGTTTTTACGCTGGCGTCACCGTCACCAATAAACCTTAGCGCGCTGGAGGAAATATCCCAGCGCGAAATTTATCGCCCCATAAACCGCGCC
Sample of my output FASTA file should be formatted as follows,
>|SalmonellaTyphi|TAAAAGCNGGTTATGTTGTCGCTTTACGGTTTTCATTCAGGACGCGCTATGGGCAATAAGTATTCCGGCCTGCAAATTGGTATTCACTGGTTAGTCTTTTTATCGCGNCGTTTTTACGCTGGCGTCACCGTCACCAATAAACCTTAGCGCGCTGGAGGAAATATCCCAGCGCGAAATTTATCGCCCCATAAACCGCGCCC