I am a quite unexperienced biologist doing a metaanalysis of RNA-seq/microarray expression data using Galaxy and I get very low mapping percentage (0.5-1.7 %) with one particular GEO dataset. I have tested 3 other Datasets and mapping with BWA (same parameters) works like a charm (80-90 % mapping).
I import the GEO datasets into Galaxy using NCBI SRA Tools, when you look at the reads quality formats they look different than other GEO uploads. I am aware that there are 3 types of fastq format and this is important for downstream application.
P.S. data is not trimmed (has Illumina adapter 1 overepresented). I assumed its Illumina 1.9 encoded (FASTQC says so) AND also tried FastqGroomer covert assumed solexa format (+64) to sanger format (+33), but still low mapping. I also tried BWA, BWA Illumina, BWA-MEM, Bowtie2.
low % mapping sample is single end 35 bp sequenced on Illumina GAIIx, no idea wether demultiplexed or prepocessed. high % mapping sample is single end 35 bp sequenced on Hiseq 2000.
Is the file format wrong or do the fastq files need more pre-processing before mapping ?
If you need more info stats SRR number ect. please say so.
Low % mapping fastq file (2 reads)
Other GEO dataset good % mapping: