Question: What is the correct regular expression for replacing inconsistent sequence and quality identifiers?
14 months ago by
egonz34010 wrote:


I am trying to analyze RNA-Sequence data. The SRA accession number for the file that I am using is SRR2422919. I downloaded this file onto Galaxy using the NCBI SRA Tool Download and Extract Reads in FASTA/Q format from NCBI SRA. I have checked NCBI and this is a paired end data set. However, the reads for this file are interleaved. I have been looking around for a way to separate the reads into two files. I intend to use Manipulate FASTQ for this because the reads do not appear to be in the format that would allow me to use FASTQ Splitter. I have included an example of what the FASTQ file looks like below.



Firstly I tried running Manipulate FASTQ while using .+/2 to create a file for the forward reads. This resulted in an error that was based on the length of time that the process ran. Realizing that there must be a mistake in my procedures I decided to search to see how others were handling the problem of separating interleaved FASTQ files. I am now trying to follow the instructions provided by Galaxy Community Hub for the full process. I noticed that the sequence identifier for the quality score name did not meet either of the accepted criteria. This being the case I tried using the following regular expressions in the Replace Text in entire line tool: ^\+SRR.+, ^\+SRR242291.+, \+SRR.+, ^\+SRR2422919\.\d+. None of the regular expression removed +SRR2422919.1. As far as I can tell the files are exactly the same.

I would like help in finding out what mistake I have made? Am I using an appropriate regular expression?

14 months ago by
United States
Jennifer Hillman Jackson25k wrote:


Part of the regular expression is missing in your examples (a backslash to escape the plus sign).

Try this instead (same as in the hub help you linked): ^\+SRR.+

So for interleaved data of this format both steps need to be done as described here, in the recommended order:

  • Correct the quality score names
  • Separate the interleaved data

Thanks! Jen, Galaxy team

I have tried this regular expression I meant to say this but it seems as though the backslash was ignored when I made my initial post. I have edited it above so that you can now see the regular expressions that I attempted to use. I did not try to use all of them at once. I ran the Replace Text tool separately in each case. I am now trying to follow the exact steps that were mentioned however, as you can see I am having trouble getting past the first step.

Respectfully, Galaxy user

