Identification Of Replicate Outlier

Question: Identification Of Replicate Outlier

6.1 years ago by

Dave Corney • 50 wrote:

Hello list, I've been analyzing an experiment with two groups each with three replicates. My workflow was TopHat (paired end) -> Cufflinks -> CuffDiff. Unfortunately, there are not many significant differences identified by CuffDiff. I am wondering whether one of my replicates might be an outlier. Does anybody have a suggestion on how to search for an outlier? The quality statistics of the unprocessed data looked equally good for all samples, so I don't think that this is a problem. Thanks, Dave

rna-seq cufflinks • 2.1k views

ADD COMMENT • link •

modified 6.1 years ago by fubar ♦ 1.1k • written 6.1 years ago by Dave Corney • 50

6.1 years ago by

fubar ♦ 1.1k

Australia

fubar ♦ 1.1k wrote:

Hi Dave, This is an interesting and non-trivial question that extends well beyond Galaxy - and there's no simple solution AFAIK Defining an 'outlier' tends to boil down to subjective judgement in most real cases I've seen. EG: see http://comments.gmane.org/gmane.science.biology.informatics.co nductor/40927 My 2c worth: a) confirm that all of your sample library sizes and quality score distributions are comparable with the FastQC tool. A sample with relatively low library size may indicate an upstream technical failure with (eg) RNA extraction or a flowcell lane. b) check that the number of unique alignments to the reference are similar (eg picard alignment summary metrics or even the samtools flagstat tool) c) if you can create an appropriate input matrix (read counts by exon or other contig for each sample eg), the Principal Component Analysis tool might be helpful (library size normalization is one devil that lies in the detail and it's not quite the same as MDS - see below) d) If you're an R hacker, you might find http://gettinggeneticsdone.blogspot.com.au/2012/09/deseq-vs-edger- comparison.html useful - it shows how to get MDS plots which are probably the most reliable way to identify samples that don't cluster well with the other members of their tribe -- Ross Lazarus MBBS MPH; Head, Medical Bioinformatics, BakerIDI; Tel: +61 385321444 http://scholar.google.com/citations?hl=en&user=UCUuEM4AAAAJ

ADD COMMENT • link written 6.1 years ago by fubar ♦ 1.1k

Hi Ross, Thanks for the suggestions. I'm aware that this is not really a Galaxy-specific question, and I've been browsing through SeqAnswers and found a couple of suggestions using edgeR or DESeq, but nothing for Tuxedo suite. However, I have no experience with either of these tools, so I was wondering how others have approached this problem if their workflow is based on Cufflinks. In the meantime, I'll go through your suggestions and see where I get. Thanks, Dave

ADD REPLY • link written 6.1 years ago by Dave Corney • 50

I like starting with this approach because it can be done easily in Galaxy. You can take the expression datasets produced by Cufflinks for each replicate and join them on gene name to get a big table of replicate-expression values and either eyeball it or use PCA. Note that since Cufflinks produces FPKM, library size is already accounted for. Another idea/approach: Cuffdiff already has an advanced model for dealing with replicates: http://cufflinks.cbcb.umd.edu/howitworks.html#reps You may want to investigate how this model works and whether you can tune it with parameter settings before giving up on using all your replicates. One challenge with this approach is that the Galaxy Cuffdiff wrapper does not yet include all parameters, so you might try enhancing the Cuffdiff wrapper with additional, relevant parameters and using those as well as the existing ones. If you do this, please consider submitting your enhancements back to me and I can integrate them into our code base. Best, J.

ADD REPLY • link written 6.1 years ago by Jeremy Goecks • 2.2k

Similar posts • Search »