Question: GEO SRA fastq-dump with very low mapping rate (Galaxy)
gravatar for human
3 months ago by
human10 wrote:

Dear Biostars,

I am a quite unexperienced biologist doing a metaanalysis of RNA-seq/microarray expression data using Galaxy and I get very low mapping percentage (0.5-1.7 %) with one particular GEO dataset. I have tested 3 other Datasets and mapping with BWA (same parameters) works like a charm (80-90 % mapping).

I import the GEO datasets into Galaxy using NCBI SRA Tools, when you look at the reads quality formats they look different than other GEO uploads. I am aware that there are 3 types of fastq format and this is important for downstream application.

P.S. data is not trimmed (has Illumina adapter 1 overepresented). I assumed its Illumina 1.9 encoded (FASTQC says so) AND also tried FastqGroomer covert assumed solexa format (+64) to sanger format (+33), but still low mapping. I also tried BWA, BWA Illumina, BWA-MEM, Bowtie2.

low % mapping sample is single end 35 bp sequenced on Illumina GAIIx, no idea wether demultiplexed or prepocessed. high % mapping sample is single end 35 bp sequenced on Hiseq 2000.

Is the file format wrong or do the fastq files need more pre-processing before mapping ?

If you need more info stats SRR number ect. please say so.

Low % mapping fastq file (2 reads)









Other GEO dataset good % mapping:

@D256N5M1:21:C00B5ABXX:5:1101:1108:2087 1:N:0:TGACCA




@D256N5M1:21:C00B5ABXX:5:1101:1059:2089 1:N:0:TGACCA






rna-seq alignment galaxy • 134 views
ADD COMMENTlink modified 27 days ago • written 3 months ago by human10
gravatar for human
27 days ago by
human10 wrote:

Hey Jennifer,

just to give the solution to my prevoius problem. Fastq upload files were wrong and not the supposed files, therefore very low mapping rate. We requested GEO to correct this issue.



ADD COMMENTlink written 27 days ago by human10

Great catch and followup! Thanks for posting back the info back here and for contacting GEO to fix the problem at the source, that will benefit everyone :)

ADD REPLYlink written 27 days ago by Jennifer Hillman Jackson25k
gravatar for Jennifer Hillman Jackson
3 months ago by
United States
Jennifer Hillman Jackson25k wrote:


Data extracted from this source, with this tool, has been re-scaled to Sanger Phred+33 format even if originally submitted in another format. This is the .fastqsanger datatype in Galaxy.

For RNA-seq reads, a splice-aware mapping tool would be a better choice. Try HISAT2.

Thanks! Jen, Galaxy team

ADD COMMENTlink written 3 months ago by Jennifer Hillman Jackson25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 83 users visited in the last hour