Question: What is the correct regular expression for replacing inconsistent sequence and quality identifiers?
0
gravatar for egonz340
6 months ago by
egonz34010
egonz34010 wrote:

Hello,

I am trying to analyze RNA-Sequence data. The SRA accession number for the file that I am using is SRR2422919. I downloaded this file onto Galaxy using the NCBI SRA Tool Download and Extract Reads in FASTA/Q format from NCBI SRA. I have checked NCBI and this is a paired end data set. However, the reads for this file are interleaved. I have been looking around for a way to separate the reads into two files. I intend to use Manipulate FASTQ for this because the reads do not appear to be in the format that would allow me to use FASTQ Splitter. I have included an example of what the FASTQ file looks like below.

@HWI-D0101:255:C5LEGANXX:1:1101:1242:2221/1 CCAGCCGCAAAACCACTTCCTAGCAAATCCGTGCGCAAGGAGTCAAAAGAAGAAACCCCTGAGGTCACAAAAGTGAATCACGTGGAAAAGCCACCCAAAGTTGAAAGCAAAGAAAAGGTAATGGT +SRR2422919.1 HWI-D0101:255:C5LEGANXX:1:1101:1242:2221 length=125 ABBBBGE>EGCGEGGGGGGGGGGGGGGGFGGGGEGGGGGF1FB@EFGGFGGGGFGGGGGGFFF==EEGGGGGGGGGGGGGEE>FC<fgggfgggf<fg0&lt;&lt;0&lt;=;fggggggggcgc>GGDGGED

@HWI-D0101:255:C5LEGANXX:1:1101:1242:2221/2 TCACCGTCTTCTCCTTGGCAGCTTTGGGTTTGACATCTGTGGCTTGCTTCTCAGCCACCTCGGCTTTCACTGGAGATGGCTCTTCTTTGCTGGGAACCTCCTTTTGAGTCACTGAAGGTTTGGTC +SRR2422919.1 HWI-D0101:255:C5LEGANXX:1:1101:1242:2221 length=125 B@BBCGGGGGGGGGGGGC>EFGGGGGGGCFGGGGCGGGGGGGGGGGGGGGGDFGGBGGGGGGFEEGF>E1EEFFG>GGGEEFGGGFGGGCGE>GB<bff@gggg09.<c<dccg;f<f0&lt;@gg@0< p="">

Firstly I tried running Manipulate FASTQ while using .+/2 to create a file for the forward reads. This resulted in an error that was based on the length of time that the process ran. Realizing that there must be a mistake in my procedures I decided to search to see how others were handling the problem of separating interleaved FASTQ files. I am now trying to follow the instructions provided by Galaxy Community Hub for the full process. I noticed that the sequence identifier for the quality score name did not meet either of the accepted criteria. This being the case I tried using the following regular expressions in the Replace Text in entire line tool: ^\+SRR.+, ^\+SRR242291.+, \+SRR.+, ^\+SRR2422919\.\d+. None of the regular expression removed +SRR2422919.1. As far as I can tell the files are exactly the same.

I would like help in finding out what mistake I have made? Am I using an appropriate regular expression?

ADD COMMENTlink modified 6 months ago • written 6 months ago by egonz34010
0
gravatar for Jennifer Hillman Jackson
6 months ago by
United States
Jennifer Hillman Jackson24k wrote:

Hello,

Part of the regular expression is missing in your examples (a backslash to escape the plus sign).

Try this instead (same as in the hub help you linked): ^\+SRR.+

So for interleaved data of this format both steps need to be done as described here, in the recommended order: https://galaxyproject.org/support/ncbi-sra-fastq/#ncbi-sra-sourced-fastq-data

  • Correct the quality score names
  • Separate the interleaved data

Thanks! Jen, Galaxy team

ADD COMMENTlink written 6 months ago by Jennifer Hillman Jackson24k

Hello,

I have tried this regular expression I meant to say this but it seems as though the backslash was ignored when I made my initial post. I have edited it above so that you can now see the regular expressions that I attempted to use. I did not try to use all of them at once. I ran the Replace Text tool separately in each case. I am now trying to follow the exact steps that were mentioned however, as you can see I am having trouble getting past the first step.

Respectfully, Galaxy user

ADD REPLYlink written 6 months ago by egonz34010
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 135 users visited in the last hour