Dear Biostars,
I am a quite unexperienced biologist doing a metaanalysis of RNA-seq/microarray expression data using Galaxy and I get very low mapping percentage (0.5-1.7 %) with one particular GEO dataset. I have tested 3 other Datasets and mapping with BWA (same parameters) works like a charm (80-90 % mapping).
I import the GEO datasets into Galaxy using NCBI SRA Tools, when you look at the reads quality formats they look different than other GEO uploads. I am aware that there are 3 types of fastq format and this is important for downstream application.
P.S. data is not trimmed (has Illumina adapter 1 overepresented). I assumed its Illumina 1.9 encoded (FASTQC says so) AND also tried FastqGroomer covert assumed solexa format (+64) to sanger format (+33), but still low mapping. I also tried BWA, BWA Illumina, BWA-MEM, Bowtie2.
low % mapping sample is single end 35 bp sequenced on Illumina GAIIx, no idea wether demultiplexed or prepocessed. high % mapping sample is single end 35 bp sequenced on Hiseq 2000.
Is the file format wrong or do the fastq files need more pre-processing before mapping ?
If you need more info stats SRR number ect. please say so.
Low % mapping fastq file (2 reads)
@SOLEXAWS1_0000:1:1:5020:1033/1
GAAAACCTTTCTCCATGACTAGTTTGAAGCTACAA
+
2../10222@@@@@@@@@C@@@@@@@@@@@@@@C@
@SOLEXAWS1_0000:1:1:9922:1034/1
AATTCTATAGAGTTTATTTAATGTTTAAATGATTT
+
(&(,)()+(@@@@@@@@222::::@@@@@@@@@@@
Other GEO dataset good % mapping:
@D256N5M1:21:C00B5ABXX:5:1101:1108:2087 1:N:0:TGACCA
CCGCAATAGCGTCTGGTGCGGCGCCTTCTTGCCGGAGCAAAT
+
CCCFFFFFHHHHHJJJEHIJJJJJJJJJIJIJJJJBHFBEFF
@D256N5M1:21:C00B5ABXX:5:1101:1059:2089 1:N:0:TGACCA
TGGTGATGTCCTCGCTCCAGTTCCCTGGGCACGCAGTGGAAG
+
@@CBDDEFHHHHHJJJIHJIDDGGJJJJIIFJJEFHDGGFGE
Thanks
Human