To complement out cuffdiff analysis we thought to take a parallel route using Homer.
to do so we basically follow this protocol:
makeTagDirectory pairend.sam -format sam -fragLength 250 -removeSpikes 10000 5 -genome /data//projects/mm9/ref.fasta -checkGC -tbp 1 -sspe -flip
Then, we run analyzeRepeats.pl to make read summaries across genes
analyzeRepeats.pl rna mm9 -noadj -count genes -strand + -d MIa9_1 MIa9_2 > Scramble_vs_Target_analyzeRepeats.txt
We are running this step as recommended by the Homer tutorial without normalization so the following step could use the unnormalized data and perform edgeR analysis on it.
Next step we run:
getDiffExpression.pl Scramble_vs_Target_analyzeRepeats.txt MIa9_1 MIa9_2 -repeats -edgeR > Scramble_vs_Target_analyzeRepeats_edgeR.txt
After testing this command and checking the perl script for getDiffExpression I can tell this:
This last step actually takes unnormalized data and transmits it to edgeR for analysis that aims to identify genes that significantly differ by expression. edgeR uses the original unnormalized data and creates a model for it from which it would derive finally log2FC, p-value, FDR.
Now, the output of getDiffExpression is a little unusual - it composed of two parts:
The three very right-side columns which are the output from edgeR (Log2FC; p-value, FDR). These are calculated based on pseudeo values in some cases and a model created by edgeR,
BUT - the rest of the table (all the left side) is a matrix that presents now newly normalized data.
However, unlike what some users I conversed with thought, this is NOT a normalized value that was generated by edgeR, but rather, this normalized data was generated by the Homer/getDiffExpression.pl script.
This data is only presented in the table for the user to see, but it was NOT actually used as part of the edgeR analysis (unlike what some users may think).
My issue is that we are currently not sure in what way this normalized data was normalized....
see this link for relevant indications
look for explanatory indication such as the following which exactly states what I've described above:
"This command will basically add 3 columns to the end of your original file per pairwise comparison. Most important are the log2 fold change and the p-value and FDR values. The raw read counts will be normalized in this file based on the total reads in each column so that the values are more easily to compare when looking through the results."
However, my attempts to re-run the previous step again using this time the normalization options in order to obtain the normalized values that appear in the output from getDiffExpression did not prove to be successful...
I tried -
analyzeRepeats.pl rna mm9 -noadj -count genes -strand + -d MIa9_1 MIa9_2 -normMatrix 1e6
analyzeRepeats.pl rna mm9 -noadj -count genes -strand + -d MIa9_1 MIa9_2 -norm 1e7
but the normalized values I obtained were quite different from the ones that were obtained after running the getDiffExpression.pl script...
We think that it would become important to know what normalization exactly was performed on the data presented in this output along with the edgeR values.
Will be glad to hear other users ideas on the subject!
Thanks a lot,