Excuse me? A few days ago I read a paper "Histone modification levels are predictive for gene expression" (PNAS (2010), 107, 2926-2931). It proposed a linear model predicting gene expression levels from the combination of different histone modifications, such as H3K4me3, H3K27me3, etc. The predictor variables were of the form log(Nj+aj), where Nj representing the number of tags of modification j in each promoter region (4001bp surrounding TSS), and aj was a pseudocount to make the logarithm be defined when Nj was zero. The authors didn't refered to normalize Nj. But when I read another paper titled "Computational inference of mRNA stability from histone modification and transcriptome profiles" (Nucleic Acids Res (2012), 40(14):6414-23), which also involving a linear model, the authors used the normalized read coverage of histone modification as the predictor variable. These authors said "the read coverage of each histone modification in the 15 regions (read count per bp) was calculated and normalized according to the sequencing library size". The former paper didn't say normalization, but the latter one said to normalize. I'm wondering why there is such a difference and in what condition, a normalization should be performed.
Hope some one to give some help. Thank you so much!
My 2c: I mostly rely on edgeR which takes raw counts over regions of interest and uses normalisation factors as an offset in the model instead of adjusting the counts directly. Competing popular methods use the library size to adjust fragment counts, so the 'right' answer depends on the specific model and biological question.
If you want advice from more reputable statisticians working on these interesting and important issues, they are more likely found on (eg) the Bioconductor list or perhaps seqanswers than this Galaxy forum - so if you don't get a good answer here, perhaps try there.