Question: Fastq To Fastqsanger Using Groomer Question
0
David K Crossman • 130 wrote:
Hello!
I am fairly new to using Galaxy and have a question
about the FASTQ Groomer feature. I have 4 RNA-Seq raw data files that
were just recently generated from Illumina's NGS instruments. I am
aware that the first step to perform in Galaxy is FASTQ Groomer to
convert the format to FASTQ Sanger. I presume that I would choose
Illumina 1.3+ in the "Input FASTQ quality scores type" box. However,
if I look at the raw data reads, I notice that Line 4 (which encodes
the quality values for sequence in Line 2) has values outside of the
Illumina 1.3+ range (some of them fall into the Sanger format. I am
enclosing the Quality Score Comparison figure along with some of the
raw RNA-Seq data):
Quality Score Comparison
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
SSSSSSSSSSSSSSSSSSSSSSSS
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
IIIIIIIIIIIIIIIIIIIIIIII
..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXX
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdef
ghijklmnopqrstuvwxyz{|}~
| | | |
| |
33 59 64 73
104 126
S - Sanger Phred+33, 93 values (0, 93) (0 to 60 expected in
raw reads)
I - Illumina 1.3 Phred+64, 62 values (0, 62) (0 to 40 expected in
raw reads)
X - Solexa Solexa+64, 67 values (-5, 62) (-5 to 40 expected in
raw reads)
Diagram adapted from http://en.wikipedia.org/wiki/FASTQ_format
RNA-Seq raw data
@HWI-ST156_294:7:1:1058:2165:0/1
CACCAACTCACAGCCACTCCGTGAGGCCAGCAAGGCAAGAACATTCATCTC
+
HHHHFGGHHHGFHHFHHEGHC<gggeb.ee9d?ddeeee4fffcbb .c="D" @hwi-st156_294:7:1:1184:2191:0="" 1="" cgtaaatccatgtctgacttctggatagcaaacaccagcaccgcgtggatg="" +="" ee;e="ECEEBE@EEEE=GBFGF/GFFC<FA;:@<8AEABB">A#########
@HWI-ST156_294:7:1:1018:2200:0/1
NCTGATTAAGGATAATGAGTTTTTAGTAGAACTAATGATGTTATTCCTTGG
+
###################################################
@HWI-ST156_294:7:1:1225:2217:0/1
GTTTTTGACTACACAAAGCACCCTTCTAAACCAGACCATTCTGGAGAATGA
+
FFCEFFFE?FEBDC?987::,3:<-9145,DA<:C9;+?############
As a test in FASTQ Groomer, I chose either the Sanger
or Illumina 1.3+ as the input quality scores type and these are the
results I got:
FASTQ Groomer on tn-read1 (using Sanger as input)
6.1 Gb
format: fastqsanger, database:mm9
Info: Groomed 45868679 sanger reads into sanger reads. Based upon
quality and sequence, the input data is valid for: sanger Input ASCII
range: '#'(35) - 'I'(73) Input decimal range: 2 - 40
FASTQ Groomer on tn-read1 (using Illumina1.3+ as input)
6.1 Gb
format: fastqsanger, database:mm9
Info: Groomed 45868679 illumina reads into sanger reads. Based upon
quality and sequence, the input data is valid for: sanger Input ASCII
range: '#'(35) - 'I'(73) Input decimal range: -29 - 9
Which one is right (I presume the Illumina 1.3+ one, but I can't find
any sort of explanation)? I noticed that the "input decimal range"
had different values (although they spanned the same length) in
relation to which input was chosen. What would happen downstream in
TopHat if Sanger was used instead of Illumina 1.3+ for these files?
Is there any other reading material/websites/etc... out there that
might help me better understand the quality score and which to use?
Any info/help would be greatly appreciated.
Thanks,
David
David K. Crossman, Ph.D.
Systems Biologist/Analyst/Statistician
Heflin Center for Genomic Science
University of Alabama at Birmingham
720 20th Street South
Kaul Room 420
Birmingham, AL 35294-0024
(205) 996-4045
(205) 996-4056 (fax)
David K. Crossman, Ph.D.<mailto:dkcrossm@uab.edu>
Heflin Center for Genomic Science<http: www.heflingenetics.uab.edu=""/>
ADD COMMENT
• link
•
modified 7.7 years ago
by
Peter Cock • 1.4k
•
written
7.7 years ago by
David K Crossman • 130