Hi, I am trying to run RNA-Seq on a local instance of Galaxy. I have almost 500gb of decompressed raw data, Is there a way to find out how much hard disk space I would need to run Tophat and cuff links on my local instance. I don't want to start the analyses have it stop due to lack of space. Thanks
Hello,
It is difficult to extrapolate the output size from a given input size. There are many factors: parameters, target genome, duplication in the input fastq datasets, etc. The output will be similar in size as if the tool was run line command.
That said, if you have multiple dataset pairs to run that you believe have about the same content and each will be using the same run-time settings during mapping, perhaps execute Tophat with just one pair first and then use that yourself to estimate how much disk each pair will consume.
You could also try asking (or searching prior Q&A) at the user group for Tophat, but I expect that you will get a similar reply. The contact info is in the right side bar here: https://ccb.jhu.edu/software/tophat/faq.shtml
Thanks, Jen, Galaxy team
From my experience the size BAM files are usually smaller than the the size of GZipped FASTQ files. Example: 2.1TB (gzipped!) FASTQ produced 1.3TB bam files using RNA-STAR. Of course, as Jennifer says, there are many factors playing a role here so it's not a rule of thumb. However, if you say you have 500gb of raw fastq, I can't imagine you will need more than another 500gb for the alignments. Compared to BAM files, the output files of cufflinks are pretty small.