Why won't Galaxy recognise uploaded hg19 .fasta.gz file?

Question: Why won't Galaxy recognise uploaded hg19 .fasta.gz file?

15 months ago by

a.g.douglas • 10 wrote:

Hi. I'm doing RNA-seq analysis for the first time and want to upload the hg19 genome from 1000 Genomes (human_g1k_v37.fasta.gz) to Galaxy so that I can then use the concatenate tool to append the fasta file from my ERCC spike-in controls (ERCC92.fa) to the genome file prior to doing alignment of my RNA-seq data. For some reason Galaxy doesn't seem to recognise the .fasta.gz genome file I've uploaded and so won't decompress it. Any ideas how to solve this?

Also, am I meant to also upload a .gtf file for the genome? I found GENODE's file gencode.v19.annotation.gtf.gz and so am hoping this will work OK. I was planning to use this to concatenate with the ERCC92.gtf file I have.

Thanks for any advice!

rna-seq genome galaxy ercc • 470 views

ADD COMMENT • link •

modified 15 months ago • written 15 months ago by a.g.douglas • 10

15 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Did you try directly assigning the datatype fasta.gz? What happens if you try that and but it fails? (how does it fail in the UI? any error messages?)

Why autodetect didn't detect/assign datatype upon upload I am not sure about, but will test to see what happens with sample data in case there is something to be fixed/improved. It could also be the size of the data impacting autodetect. If anything comes from that, I'll write back.

Please also be aware that using the hg19 genome as a custom genome is likely to trigger jobs that exceed the memory resources for jobs at Galaxy Main (https://usegalaxy.org). It is compute intensive to index larger genomes, on-the-fly, each job. Your analysis would be better done in a local, docker, or cloud Galaxy with sufficient resources and where you can index the genome directly, instead of using it as a custom reference genome. But it may be that you are already working on your own server and this was a way to prep a novel indexed genome.

For the genome matching the gtf dataset - it depends on where you sourced the 1000 genomes database on whether these are a match. Compare the chromosome identifiers. These need to be exactly the same between the two datasets to use them in the same analysis.

FAQs for data: https://galaxyproject.org/support/#getting-inputs-right-

Jen, Galaxy team

ADD COMMENT • link written 15 months ago by Jennifer Hillman Jackson ♦ 25k

15 months ago by

a.g.douglas • 10

a.g.douglas • 10 wrote:

Thanks Jen! Unfortunately I couldn't see any fasta.gz option in the Datatype tab. I tried a few times to upload the hg19 fasta.gz file in different ways through the Get Data tool. A few times I got red error messages saying "An error occurred with this dataset:" and the one time it did seem to work and the job turned green it said "Problem decompressing gzipped data" and told me the file was 0 bytes!

Sorry - yes I have been trying to use the Galaxy Main site for this - I didn't realise using a custom hg19 to align to would be too big a job for the main site. I will instead try aligning my data to the built-in hg19. But in that case how can I use the ERCC spike in control reads that I have in my RNA-seq data? I've read a few posts on this forum that suggest appending the ERCC fasta and gtf files to hg19 and then using this as a custom genome, which is why I was originally trying this approach. If I use the built-in hg19 to align my reads, can I later go back and do a separate alignment using the ERCC92.fa / ERCC92.gtf files to check the controls?

ADD COMMENT • link written 15 months ago by a.g.douglas • 10

Please log in to add an answer.

Similar posts • Search »