Mark Duplicates - MAPQ = 0 and Value was put into PairInfoMap more than once

Question: Mark Duplicates - MAPQ = 0 and Value was put into PairInfoMap more than once

3.8 years ago by

United Kingdom

Dear Biostars Galaxy team,

I'm relatively new to using the Galaxy online platform and have been using it to run RNAseq with some paired end RNA data from an Illumina run with the rat rn5 genome. After completing our RNAseq analysis we're trying to look for SNP/Variants within the reads but I am having issue getting the files pre-processed with Picard tools with the Mark Duplicate Reads tool throwing up a couple of errors halting progress. These are the steps that I have taken so far:

Raw Illumina Fastq files ftp'd to usegalaxy public instance (FOR and REV for 2 lanes)
FASTQ Groomer - convert to fastqsanger
Trim by FASTQ quality score >=20
Map with BWA for Illumina using rn5 and paired end reads
Convert SAM to BAM for both mapped lane files
Reorder BAM for both files
Add read groups for both files
Mark Duplicate reads - removing duplicates from output -> here is where we get the issue.

The two bugs that are thrown up are - "MAPQ should be 0 for unmapped read." and "Value was put into PairInfoMap more than once" which halt this pre-processing step before moving the BAM files onto GATK Variant analysis.

In addition, for running the GATK analysis, is the best practice for using a custom genome just ftping the USCS rn5.fa file into a history and using that or should there be an additional index file for this?

Any help with regards to these issues would be greatly appreciated and please let me know if I need to clarify anything for a solution to be found!

Many thanks,

Christian

bwa snp rna paired end mark duplicate reads • 2.0k views

ADD COMMENT • link •

modified 3.8 years ago by Jennifer Hillman Jackson ♦ 25k • written 3.8 years ago by christianwood7311 • 0

3.8 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Are you data for this run RNA or DNA? It sounds like you are using the same data for both analysis paths, but maybe that is an incorrect assumption. Using BWA for RNA data will be problematic.

The UCSC genome rn5 is on our list to index natively for the mapping tools, but meanwhile use it as a Custom Reference Genome.

You will need it as a Custom genome for use with GATK tools anyway. Format the genome first, before used for any mapping. Order the chromosomes in a "GATK" sort order: "chr1, chr2, chr3,.... chrX, chrY, chrM" (if you include the unmapped/haplo contigs, add them to the end after the full chromosomes, sorted alphabetically). There is no tool to do this sorting in a single step - instead you can do this line command before upload, or within Galaxy use the tools in Text Manipulation to break up the file (converted to tabular), sort the autosomes/others, then "Concatenate" the results in the correct order, ending with a conversion back to fasta format and line wrapping.

If after this you are still having problems, and can duplicate the problem on the public Main Galaxy instance at http://usegalaxy.org, send in one of the error datasets as a bug report and we can take a closer look. Please include a link to this Biostar thread in the comments, and make certain all datasets in the analysis thread are undeleted.

Thanks, Jen, Galaxy team

ADD COMMENT • link written 3.8 years ago by Jennifer Hillman Jackson ♦ 25k

Hi Jennifer,

Thank you for your reply with this matter and for bearing with my lack of knowledge from my initial question and this reply. I am running RNA paired end data through the galaxy platform looking for SNPs and Indels in two different strains of rat and am trying to figure out the best/most appropriate method to use for the variant analysis. With regards to the mapping you suggest not to use BWA, would Bowtie or Tophat be more preferable in this instance?

With regards to the custom reference genome preparation I have converted the file to tabular but am slightly unsure how to go about the steps you have suggested. From using the 'Compute sequence length' tool I can see that the genome is not currently in the correct format with the list going from chr1, chr10, chr11 and so on. Is there a particular tutorial available to show me how to do these manipulations correctly at all? I'm unsure how to break apart the large tabular file that has been created into smaller parts that I can then sort and subsequently combine using the 'concatenate' tool.

Any help with regards to this would be greatly appreciated.

Many thanks,

Christian

ADD REPLY • link written 3.8 years ago by christianwood7311 • 0

Similar posts • Search »