Too few reads after VarScan on RNA-Seq data?

Question: Too few reads after VarScan on RNA-Seq data?

2.1 years ago by

Hi all,

I have been trying to initiate a protocol to call SNPs in RNA-Seq data, but have had a few problems. I have carried out the main steps below:

Import known SNPs from hg19 and RNA-Seq data

Convert to FASTQ

FastQC on RNA-Seq data

Convert to FastQSanger

Split paired end reads into forward and reverse reads

CollectInsertSizeMetrics (Mean insert size = 149.53 SD = 37.691812)

FASTQ to FASTA on forward and reverse reads

Compute sequence length on forward and reverse reads

Summary Statistics on forward and reverse reads (Mean = 37.6992, SD = 1.79589)

TopHat on forward and reverse reads (Mean inner distance = 300-(38+38) SD = 38)

MPileup on TopHat data

Varscan on MPileup data

Sort into chromosomal order

Filter for all mutations from an 'A' to 'G'

However, when I carry out VarScan it only returns 39 reads, far too few for me to be expecting from a 4.5GB file of over 128 million reads. Can anybody see if I've gone obviously wrong somewhere in my protocol?

Many thanks,

Frankie.

rna-seq tophat snp varscan mpileup • 796 views

ADD COMMENT • link •

modified 2.1 years ago by Jennifer Hillman Jackson ♦ 25k • written 2.1 years ago by frankie.north • 10

2.1 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

The protocol looks good. I'll suggest a few ways to find out what is causing this (there are certainly more/others):

Examine how well the mapping step with Tophat went - you may have lost reads there. Maximising concordant reads is your goal.
The reads are short. In Tophat's Full Parameters, examine Minimum length of read segments. This should be set to one-half of the shortest (or mean? you decide) sequence length. Any sequence shorter than twice that length will map with bias (or not map).
Run FastQC on the actual input FASTQ data (it was run earlier, but that was to see if the groomer was needed). Clues about data quality may link to mapping issues. The data is a bit short to trim - anything with problems will likely just not map, and getting it to map - if clipped - will be a challenge).
Perhaps play around with the values for insert size - what is expected from the lab is not always actual in the data!
Examine MACS as well. See the manual for the settings.
Run Varscan with a few of the modified inputs/params for upstream jobs to find the settings that give the best results. This doesn't leave out that you might need to go back to the lab - should there be suspected core data usability issues - sample mixups plus everything else that can go wrong!.

Thanks, Jen, Galaxy team

ADD COMMENT • link modified 2.1 years ago • written 2.1 years ago by Jennifer Hillman Jackson ♦ 25k

Thank you Jen, I shall try the troubleshooting tips you have suggested!

ADD REPLY • link written 2.1 years ago by frankie.north • 10

Similar posts • Search »