Hello,
I am trying to analyze RNA-Sequence data. The SRA accession number for the file that I am using is SRR2422919. I downloaded this file onto Galaxy using the NCBI SRA Tool Download and Extract Reads in FASTA/Q format from NCBI SRA. I have checked NCBI and this is a paired end data set. However, the reads for this file are interleaved. I have been looking around for a way to separate the reads into two files. I intend to use Manipulate FASTQ for this because the reads do not appear to be in the format that would allow me to use FASTQ Splitter. I have included an example of what the FASTQ file looks like below.
@HWI-D0101:255:C5LEGANXX:1:1101:1242:2221/1 CCAGCCGCAAAACCACTTCCTAGCAAATCCGTGCGCAAGGAGTCAAAAGAAGAAACCCCTGAGGTCACAAAAGTGAATCACGTGGAAAAGCCACCCAAAGTTGAAAGCAAAGAAAAGGTAATGGT +SRR2422919.1 HWI-D0101:255:C5LEGANXX:1:1101:1242:2221 length=125 ABBBBGE>EGCGEGGGGGGGGGGGGGGGFGGGGEGGGGGF1FB@EFGGFGGGGFGGGGGGFFF==EEGGGGGGGGGGGGGEE>FC<fgggfgggf<fg0<<0<=;fggggggggcgc>GGDGGED
@HWI-D0101:255:C5LEGANXX:1:1101:1242:2221/2 TCACCGTCTTCTCCTTGGCAGCTTTGGGTTTGACATCTGTGGCTTGCTTCTCAGCCACCTCGGCTTTCACTGGAGATGGCTCTTCTTTGCTGGGAACCTCCTTTTGAGTCACTGAAGGTTTGGTC +SRR2422919.1 HWI-D0101:255:C5LEGANXX:1:1101:1242:2221 length=125 B@BBCGGGGGGGGGGGGC>EFGGGGGGGCFGGGGCGGGGGGGGGGGGGGGGDFGGBGGGGGGFEEGF>E1EEFFG>GGGEEFGGGFGGGCGE>GB<bff@gggg09.<c<dccg;f<f0<@gg@0< p="">
Firstly I tried running Manipulate FASTQ while using .+/2 to create a file for the forward reads. This resulted in an error that was based on the length of time that the process ran. Realizing that there must be a mistake in my procedures I decided to search to see how others were handling the problem of separating interleaved FASTQ files. I am now trying to follow the instructions provided by Galaxy Community Hub for the full process. I noticed that the sequence identifier for the quality score name did not meet either of the accepted criteria. This being the case I tried using the following regular expressions in the Replace Text in entire line tool: ^\+SRR.+, ^\+SRR242291.+, \+SRR.+, ^\+SRR2422919\.\d+
. None of the regular expression removed +SRR2422919.1. As far as I can tell the files are exactly the same.
I would like help in finding out what mistake I have made? Am I using an appropriate regular expression?