Finding Percent Totals in RNA-Seq

Question: Finding Percent Totals in RNA-Seq

19 months ago by

Does anyone know how to sum and divide RNA-Seq datasets? I have two datasets (FPKM values) and want to find the percent total of all genes in each dataset (they are two fractions of a whole sample), so essentially I need to divide dataset 1 by the sum of dataset 1 and 2. I can't seem to find a straightforward way to do this, preferably in Galaxy or R because I am new to this stuff. Seems like simple math and I can do it for individual genes but want to plot all genes into a nice figure to see trends. Thanks!

rna-seq statistics cuffdiff data-manipulation • 502 views

ADD COMMENT • link •

modified 18 months ago by Jennifer Hillman Jackson ♦ 25k • written 19 months ago by annie.e.collier • 20

18 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

If a reference annotation dataset was used as an input to Cuffdiff that includes the attribute gene_name, then that common gene identifier can be used to link the two dataset's content and perform summary calculations.

Group or Datamash can be used to count and add up the number of occurrences of each gene identifier
Line/Word/Character count can be used on that output to count up the number of total unique genes
- The number of lines equals the number of genes
Join two files can be used to merge datasets together by a common gene identifier
Compute can be used to add, divide, and multiply values in tabular data per line. (plus other functions)
- Example syntax for add: c5+c6 would mean column 5 plus column 6. The result can be rounded or not depending on the desired output value
- Example syntax for divide: c5/c6 would mean column 5 divided by column 6. Do not round the result to get the result as a fraction
- Example syntax for multiply: cX*100" where cX is the fraction and the result is a percentage value. This can be rounded or not, although rounding will make graphing easier

The steps are performed by individual tools where many are similar to line-command functions. Once you work out a protocol, extract a workflow from the history, edit the workflow for just these operations, and reuse that workflow to create what will become in essence a single-click "custom tool".

Related manual help for Cuffdiff inputs, output file formats, and where to obtain reference annotation files that contain all the attribute values used by tools in this suite (p_id, tss_id, gene_name):

http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/index.html#cuffdiff-input-files
http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/index.html#differential-expression-tests
http://cole-trapnell-lab.github.io/cufflinks/getting_started/#using-pre-built-annotation-packages (for common reference genomes) or https://support.illumina.com/sequencing/sequencing_software/igenome.html (all available)
- Note: Download the tar file locally, uncompress it, then upload just the genes.gtf file to Galaxy

Hopefully this helps! Jen, Galaxy team

ADD COMMENT • link modified 18 months ago • written 18 months ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »