Question: GEO SRA fastq-dump with very low mapping rate (Galaxy)
0
gravatar for human
6 days ago by
human0
human0 wrote:

Dear Biostars,

I am a quite unexperienced biologist doing a metaanalysis of RNA-seq/microarray expression data using Galaxy and I get very low mapping percentage (0.5-1.7 %) with one particular GEO dataset. I have tested 3 other Datasets and mapping with BWA (same parameters) works like a charm (80-90 % mapping).

I import the GEO datasets into Galaxy using NCBI SRA Tools, when you look at the reads quality formats they look different than other GEO uploads. I am aware that there are 3 types of fastq format and this is important for downstream application.

P.S. data is not trimmed (has Illumina adapter 1 overepresented). I assumed its Illumina 1.9 encoded (FASTQC says so) AND also tried FastqGroomer covert assumed solexa format (+64) to sanger format (+33), but still low mapping. I also tried BWA, BWA Illumina, BWA-MEM, Bowtie2.

low % mapping sample is single end 35 bp sequenced on Illumina GAIIx, no idea wether demultiplexed or prepocessed. high % mapping sample is single end 35 bp sequenced on Hiseq 2000.

Is the file format wrong or do the fastq files need more pre-processing before mapping ?

If you need more info stats SRR number ect. please say so.

Low % mapping fastq file (2 reads)

@SOLEXAWS1_0000:1:1:5020:1033/1

GAAAACCTTTCTCCATGACTAGTTTGAAGCTACAA

+

2../10222@@@@@@@@@C@@@@@@@@@@@@@@C@

@SOLEXAWS1_0000:1:1:9922:1034/1

AATTCTATAGAGTTTATTTAATGTTTAAATGATTT

+

(&(,)()+(@@@@@@@@222::::@@@@@@@@@@@

Other GEO dataset good % mapping:

@D256N5M1:21:C00B5ABXX:5:1101:1108:2087 1:N:0:TGACCA

CCGCAATAGCGTCTGGTGCGGCGCCTTCTTGCCGGAGCAAAT

+

CCCFFFFFHHHHHJJJEHIJJJJJJJJJIJIJJJJBHFBEFF

@D256N5M1:21:C00B5ABXX:5:1101:1059:2089 1:N:0:TGACCA

TGGTGATGTCCTCGCTCCAGTTCCCTGGGCACGCAGTGGAAG

+

@@CBDDEFHHHHHJJJIHJIDDGGJJJJIIFJJEFHDGGFGE

Thanks

Human

rna-seq alignment galaxy • 33 views
ADD COMMENTlink modified 5 days ago by Jennifer Hillman Jackson24k • written 6 days ago by human0
0
gravatar for Jennifer Hillman Jackson
5 days ago by
United States
Jennifer Hillman Jackson24k wrote:

Hello,

Data extracted from this source, with this tool, has been re-scaled to Sanger Phred+33 format even if originally submitted in another format. This is the .fastqsanger datatype in Galaxy.

For RNA-seq reads, a splice-aware mapping tool would be a better choice. Try HISAT2.

Thanks! Jen, Galaxy team

ADD COMMENTlink written 5 days ago by Jennifer Hillman Jackson24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 117 users visited in the last hour