GEO SRA fastq-dump with very low mapping rate (Galaxy)

Question: GEO SRA fastq-dump with very low mapping rate (Galaxy)

9 months ago by

human • 10

human • 10 wrote:

Dear Biostars,

I am a quite unexperienced biologist doing a metaanalysis of RNA-seq/microarray expression data using Galaxy and I get very low mapping percentage (0.5-1.7 %) with one particular GEO dataset. I have tested 3 other Datasets and mapping with BWA (same parameters) works like a charm (80-90 % mapping).

I import the GEO datasets into Galaxy using NCBI SRA Tools, when you look at the reads quality formats they look different than other GEO uploads. I am aware that there are 3 types of fastq format and this is important for downstream application.

P.S. data is not trimmed (has Illumina adapter 1 overepresented). I assumed its Illumina 1.9 encoded (FASTQC says so) AND also tried FastqGroomer covert assumed solexa format (+64) to sanger format (+33), but still low mapping. I also tried BWA, BWA Illumina, BWA-MEM, Bowtie2.

low % mapping sample is single end 35 bp sequenced on Illumina GAIIx, no idea wether demultiplexed or prepocessed. high % mapping sample is single end 35 bp sequenced on Hiseq 2000.

Is the file format wrong or do the fastq files need more pre-processing before mapping ?

If you need more info stats SRR number ect. please say so.

Low % mapping fastq file (2 reads)

@SOLEXAWS1_0000:1:1:5020:1033/1

GAAAACCTTTCTCCATGACTAGTTTGAAGCTACAA

2../10222@@@@@@@@@C@@@@@@@@@@@@@@C@

@SOLEXAWS1_0000:1:1:9922:1034/1

AATTCTATAGAGTTTATTTAATGTTTAAATGATTT

(&(,)()+(@@@@@@@@222::::@@@@@@@@@@@

Other GEO dataset good % mapping:

@D256N5M1:21:C00B5ABXX:5:1101:1108:2087 1:N:0:TGACCA

CCGCAATAGCGTCTGGTGCGGCGCCTTCTTGCCGGAGCAAAT

CCCFFFFFHHHHHJJJEHIJJJJJJJJJIJIJJJJBHFBEFF

@D256N5M1:21:C00B5ABXX:5:1101:1059:2089 1:N:0:TGACCA

TGGTGATGTCCTCGCTCCAGTTCCCTGGGCACGCAGTGGAAG

@@CBDDEFHHHHHJJJIHJIDDGGJJJJIIFJJEFHDGGFGE

Thanks

Human

rna-seq alignment galaxy • 386 views

ADD COMMENT • link •

modified 7 months ago • written 9 months ago by human • 10

7 months ago by

human • 10

human • 10 wrote:

Hey Jennifer,

just to give the solution to my prevoius problem. Fastq upload files were wrong and not the supposed files, therefore very low mapping rate. We requested GEO to correct this issue.

Best

Human

ADD COMMENT • link written 7 months ago by human • 10

Great catch and followup! Thanks for posting back the info back here and for contacting GEO to fix the problem at the source, that will benefit everyone :)

ADD REPLY • link written 7 months ago by Jennifer Hillman Jackson ♦ 25k

9 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Data extracted from this source, with this tool, has been re-scaled to Sanger Phred+33 format even if originally submitted in another format. This is the .fastqsanger datatype in Galaxy.

For RNA-seq reads, a splice-aware mapping tool would be a better choice. Try HISAT2.

Thanks! Jen, Galaxy team

ADD COMMENT • link written 9 months ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »