Question: Why won't Galaxy recognise uploaded hg19 .fasta.gz file?
1
gravatar for a.g.douglas
4 months ago by
a.g.douglas10
a.g.douglas10 wrote:

Hi. I'm doing RNA-seq analysis for the first time and want to upload the hg19 genome from 1000 Genomes (human_g1k_v37.fasta.gz) to Galaxy so that I can then use the concatenate tool to append the fasta file from my ERCC spike-in controls (ERCC92.fa) to the genome file prior to doing alignment of my RNA-seq data. For some reason Galaxy doesn't seem to recognise the .fasta.gz genome file I've uploaded and so won't decompress it. Any ideas how to solve this?

Also, am I meant to also upload a .gtf file for the genome? I found GENODE's file gencode.v19.annotation.gtf.gz and so am hoping this will work OK. I was planning to use this to concatenate with the ERCC92.gtf file I have.

Thanks for any advice!

rna-seq genome galaxy ercc • 139 views
ADD COMMENTlink modified 4 months ago • written 4 months ago by a.g.douglas10
0
gravatar for Jennifer Hillman Jackson
4 months ago by
United States
Jennifer Hillman Jackson23k wrote:

Hello,

Did you try directly assigning the datatype fasta.gz? What happens if you try that and but it fails? (how does it fail in the UI? any error messages?)

Why autodetect didn't detect/assign datatype upon upload I am not sure about, but will test to see what happens with sample data in case there is something to be fixed/improved. It could also be the size of the data impacting autodetect. If anything comes from that, I'll write back.

Please also be aware that using the hg19 genome as a custom genome is likely to trigger jobs that exceed the memory resources for jobs at Galaxy Main (https://usegalaxy.org). It is compute intensive to index larger genomes, on-the-fly, each job. Your analysis would be better done in a local, docker, or cloud Galaxy with sufficient resources and where you can index the genome directly, instead of using it as a custom reference genome. But it may be that you are already working on your own server and this was a way to prep a novel indexed genome.

For the genome matching the gtf dataset - it depends on where you sourced the 1000 genomes database on whether these are a match. Compare the chromosome identifiers. These need to be exactly the same between the two datasets to use them in the same analysis.

FAQs for data: https://galaxyproject.org/support/#getting-inputs-right-

Jen, Galaxy team

ADD COMMENTlink written 4 months ago by Jennifer Hillman Jackson23k
0
gravatar for a.g.douglas
4 months ago by
a.g.douglas10
a.g.douglas10 wrote:

Thanks Jen! Unfortunately I couldn't see any fasta.gz option in the Datatype tab. I tried a few times to upload the hg19 fasta.gz file in different ways through the Get Data tool. A few times I got red error messages saying "An error occurred with this dataset:" and the one time it did seem to work and the job turned green it said "Problem decompressing gzipped data" and told me the file was 0 bytes!

Sorry - yes I have been trying to use the Galaxy Main site for this - I didn't realise using a custom hg19 to align to would be too big a job for the main site. I will instead try aligning my data to the built-in hg19. But in that case how can I use the ERCC spike in control reads that I have in my RNA-seq data? I've read a few posts on this forum that suggest appending the ERCC fasta and gtf files to hg19 and then using this as a custom genome, which is why I was originally trying this approach. If I use the built-in hg19 to align my reads, can I later go back and do a separate alignment using the ERCC92.fa / ERCC92.gtf files to check the controls?

ADD COMMENTlink written 4 months ago by a.g.douglas10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 104 users visited in the last hour