Question: GEO SRA fastq-dump with very low mapping rate (Galaxy)
3 months ago by
human10 wrote:

Dear Biostars,

I am a quite unexperienced biologist doing a metaanalysis of RNA-seq/microarray expression data using Galaxy and I get very low mapping percentage (0.5-1.7 %) with one particular GEO dataset. I have tested 3 other Datasets and mapping with BWA (same parameters) works like a charm (80-90 % mapping).

I import the GEO datasets into Galaxy using NCBI SRA Tools, when you look at the reads quality formats they look different than other GEO uploads. I am aware that there are 3 types of fastq format and this is important for downstream application.

P.S. data is not trimmed (has Illumina adapter 1 overepresented). I assumed its Illumina 1.9 encoded (FASTQC says so) AND also tried FastqGroomer covert assumed solexa format (+64) to sanger format (+33), but still low mapping. I also tried BWA, BWA Illumina, BWA-MEM, Bowtie2.

low % mapping sample is single end 35 bp sequenced on Illumina GAIIx, no idea wether demultiplexed or prepocessed. high % mapping sample is single end 35 bp sequenced on Hiseq 2000.

Is the file format wrong or do the fastq files need more pre-processing before mapping ?

If you need more info stats SRR number ect. please say so.

Low % mapping fastq file (2 reads)









Other GEO dataset good % mapping:

@D256N5M1:21:C00B5ABXX:5:1101:1108:2087 1:N:0:TGACCA




@D256N5M1:21:C00B5ABXX:5:1101:1059:2089 1:N:0:TGACCA






rna-seq alignment galaxy • 134 views
modified 27 days ago • written 3 months ago by human10
27 days ago by
human10 wrote:

Hey Jennifer,

just to give the solution to my prevoius problem. Fastq upload files were wrong and not the supposed files, therefore very low mapping rate. We requested GEO to correct this issue.



written 27 days ago by human10

Great catch and followup! Thanks for posting back the info back here and for contacting GEO to fix the problem at the source, that will benefit everyone :)

written 27 days ago by Jennifer Hillman Jackson25k
3 months ago by
United States
Jennifer Hillman Jackson25k wrote:


Data extracted from this source, with this tool, has been re-scaled to Sanger Phred+33 format even if originally submitted in another format. This is the .fastqsanger datatype in Galaxy.

For RNA-seq reads, a splice-aware mapping tool would be a better choice. Try HISAT2.

Thanks! Jen, Galaxy team

written 3 months ago by Jennifer Hillman Jackson25k
