What is the correct regular expression for replacing inconsistent sequence and quality identifiers?

Question: What is the correct regular expression for replacing inconsistent sequence and quality identifiers?

14 months ago by

egonz340 • 10 wrote:

Hello,

I am trying to analyze RNA-Sequence data. The SRA accession number for the file that I am using is SRR2422919. I downloaded this file onto Galaxy using the NCBI SRA Tool Download and Extract Reads in FASTA/Q format from NCBI SRA. I have checked NCBI and this is a paired end data set. However, the reads for this file are interleaved. I have been looking around for a way to separate the reads into two files. I intend to use Manipulate FASTQ for this because the reads do not appear to be in the format that would allow me to use FASTQ Splitter. I have included an example of what the FASTQ file looks like below.

@HWI-D0101:255:C5LEGANXX:1:1101:1242:2221/1 CCAGCCGCAAAACCACTTCCTAGCAAATCCGTGCGCAAGGAGTCAAAAGAAGAAACCCCTGAGGTCACAAAAGTGAATCACGTGGAAAAGCCACCCAAAGTTGAAAGCAAAGAAAAGGTAATGGT +SRR2422919.1 HWI-D0101:255:C5LEGANXX:1:1101:1242:2221 length=125 ABBBBGE>EGCGEGGGGGGGGGGGGGGGFGGGGEGGGGGF1FB@EFGGFGGGGFGGGGGGFFF==EEGGGGGGGGGGGGGEE>FC<fgggfgggf<fg0<<0<=;fggggggggcgc>GGDGGED

@HWI-D0101:255:C5LEGANXX:1:1101:1242:2221/2 TCACCGTCTTCTCCTTGGCAGCTTTGGGTTTGACATCTGTGGCTTGCTTCTCAGCCACCTCGGCTTTCACTGGAGATGGCTCTTCTTTGCTGGGAACCTCCTTTTGAGTCACTGAAGGTTTGGTC +SRR2422919.1 HWI-D0101:255:C5LEGANXX:1:1101:1242:2221 length=125 B@BBCGGGGGGGGGGGGC>EFGGGGGGGCFGGGGCGGGGGGGGGGGGGGGGDFGGBGGGGGGFEEGF>E1EEFFG>GGGEEFGGGFGGGCGE>GB<bff@gggg09.<c<dccg;f<f0<@gg@0< p="">

Firstly I tried running Manipulate FASTQ while using .+/2 to create a file for the forward reads. This resulted in an error that was based on the length of time that the process ran. Realizing that there must be a mistake in my procedures I decided to search to see how others were handling the problem of separating interleaved FASTQ files. I am now trying to follow the instructions provided by Galaxy Community Hub for the full process. I noticed that the sequence identifier for the quality score name did not meet either of the accepted criteria. This being the case I tried using the following regular expressions in the Replace Text in entire line tool: ^\+SRR.+, ^\+SRR242291.+, \+SRR.+, ^\+SRR2422919\.\d+. None of the regular expression removed +SRR2422919.1. As far as I can tell the files are exactly the same.

I would like help in finding out what mistake I have made? Am I using an appropriate regular expression?

fastq identifiers interleaved ncbi • 481 views

ADD COMMENT • link •

modified 14 months ago • written 14 months ago by egonz340 • 10

14 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Part of the regular expression is missing in your examples (a backslash to escape the plus sign).

Try this instead (same as in the hub help you linked): ^\+SRR.+

So for interleaved data of this format both steps need to be done as described here, in the recommended order: https://galaxyproject.org/support/ncbi-sra-fastq/#ncbi-sra-sourced-fastq-data

Correct the quality score names
Separate the interleaved data

Thanks! Jen, Galaxy team

ADD COMMENT • link written 14 months ago by Jennifer Hillman Jackson ♦ 25k

Hello,

I have tried this regular expression I meant to say this but it seems as though the backslash was ignored when I made my initial post. I have edited it above so that you can now see the regular expressions that I attempted to use. I did not try to use all of them at once. I ran the Replace Text tool separately in each case. I am now trying to follow the exact steps that were mentioned however, as you can see I am having trouble getting past the first step.

Respectfully, Galaxy user

ADD REPLY • link written 14 months ago by egonz340 • 10

Similar posts • Search »