2.2 years ago by
United States
Hello,
I checked your history with the analysis. There is a reference genome mismatch problem between the natively indexed TAIR10 genome at Galaxy Main (http://usegalaxy.org) and the GTF reference annotation from iGenomes.
Good news!! You are very close to obtaining a successful run and avoiding issues with other tools. The genome.fa from the same annotation bundle from iGenomes is already in your history! To solve the current problem and later potential issues, do the following:
- Re-map the reads again the TAIR10 *genome.fa genome from the bundle*. This will ensure that all data going forward is based on the same exact reference genome with identical chromosome identifiers. This is very important to obtain valid analysis results - whether tools fail or not.
- Use the Custom Reference genome option with the mapping tool. General help with video guides and other quick tips: https://galaxyproject.org/learn/custom-genomes/
- DO NOT assign the database metadata attribute as the natively indexed TAIR10 genome.
- Instead, promote the Custom Genome to a Custom Build (detailed help in the same link above).
- Assign that Custom Genome Build as the metadata "Database" attribute to the BAM and all other datasets associated with this genome (generated by tools - if not done by default - plus upload datasets used). Again, this avoids issues and ensures tools use the correct genome build. It is worth the extra steps. No one likes to start over from mapping.
- Mapping tools will not use this database assignment, but many other common tools do, and this proper database assignment will avoid further confusing issues/poor results. The goal is to fully annotate datasets with the actual genome used.
Note that some sources of Custom Reference genomes (in particular those from NCBI, or those assembled yourself) have title lines with complicated/extended annotation - not just the simple chromosome identifiers. Before starting an analysis, clean up the title line so that only chromosome identifiers remain (the ">" line in a fasta dataset) and re-wrap the fasta file lines at 80 bases before creating a Custom genome/build. Use the tool NormalizeFasta (also explained in the link above).
Please try the above and let us know if you need more help. Cheers! Jen, Galaxy team
Do the chromosome names in your GTF and BAM file match? That's the most frequent cause of this.
Agree with Devon. Try running BAM-to-SAM on the aligned data outputting just the SAM header and compare the chromosome identifiers between all inputs. More: https://wiki.galaxyproject.org/Support#Reference_genomes
Jen, Galaxy team
Hi,
Thanks for your help, I do not know really how to check this, It is the first time that I use Galaxy. I took the GTF file from the Arabidopsis_thaliana_Ensembl_TAIR10.tar and it present the format that I show below:
Seqname Source Feature Start End Score Strand Frame Attributes 1 ensembl UTR 3631 3759 . + . gene_biotype "protein_coding"; gene_id "AT1G01010"; gene_name "NAC001"; gene_source "ensembl"; gene_version "1"; p_id "P20332"; transcript_biotype "protein_coding"; transcript_id "AT1G01010.1"; transcript_name "ANAC001"; transcript_source "ensembl"; transcript_version "1"; tss_id "TSS22525";
The BAM file is the output of th TopHat program in Galaxy.
Is it the problem?
Thanks!
Did you try the help in the wiki? This is the detailed "how-to-check" linked from the general help shared in my first comment: https://wiki.galaxyproject.org/Support/ChromIdentifiers
Learning how to do this will avoid many headaches in the future. Input mismatch problems are very common and can be avoided. Data with mismatches will never process correctly to produce valid results, whether the tool errors or not.
I am also going into your account to check. Will write back when done with exactly how I checked your particular data as a reply.