When looking closely to my alignments data I found something interesting. Some of my reads are aligned to the Y chromosome while the sample is from a ovarian cancer cell line - in short a female donor.
Indeed all of the these reads are aligned to repeated regions and for each gene on the Y chromosome having any reads aligned I can find a paralogue on the X chromosome.
Although these reads do not represent a high occurence, I still fear that it may falsify the calculation for gene / transcript expression level, since the genes on autosomes are not affected by the duplication.
I wonder if there is any way to turn off the Y chromosome when using tophat2 (I'm aware of the simple method of removing Y chromosome temporally) or merge the read counts before doing downstream analysis.
This helps. If working command-line, then obtaining our version of the hg19female variant, along with assorted useful indexes (including Tophat2 ... the <dbkey>.*.bt2 files) is another option. All available on our rsync server in the hg19 top level directory. Link with instructions: http://wiki.galaxyproject.org/Admin/UseGalaxyRsync
Should you decide to try this, the .loc files in the /location directory are formatted in a way such that results are redirected back to the full hg19 assembly. Very useful for visualization at UCSC, use with other tools and reference files (later in the Cuff* tools), etcetera.
Good luck with you project, Jen, Galaxy team
Thanks for your answer. I'm actually working with my local serveur via cmd line.
Judging by the presence of *random.fa files in the genome I think I'm using the hg19 full version. I think I'll just remove the Y chromosome and other unwanted .fa files next time.
Thank you very much for the link and your effort!
Please accept the answer to help others find it. Thanks.