Hi
I have paired end RNA-Seq reads that I aligned to my reference genome using HISAT2 (alignment mostly >90%) and obtained BAM files, as a result. When I tried to obtain count information for these BAM files using the annotated file of the same build of the reference genome, I find that the resulting file has count values of 0, throughout.
I used exon as the GFF feature, excluded chimeric fragments, only allowed fragments with both reads aligned and allowed exon-exon junctions. The rest were all default settings, including not counting multimapping reads. I could see that there was a junction counts file having values for counts, though.
This is my counts summary :
Assigned 0
Unassigned_Unmapped 6571202
Unassigned_MappingQuality 5083205
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 0
Unassigned_Secondary 0
Unassigned_Nonjunction 0
Unassigned_NoFeatures 64827954
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 0
Could someone help me with this, please?
Thanks!
It's likely that the GFF (try to find a GTF file, they tend to work better) didn't match your genome version. To check this, load both a BAM file and your GFF file in IGV and spot check a few genes too see if the lack of counts is reasonable.
Thanks for your reply Devon!
I uploaded a GTF file and even compared 2 runs : one with mismatched genome version and the other with the right genome version and both gave me the same results. So I'm not sure what next to do.
I viewed the GTF file in IGV and it does have chromosome start and end locations, features, etc. - characteristic of an annotation file. I am unable to view my BAM file on usegalaxy.org as it keeps crashing (due to the size I guess?).
Do you think I've made a mistake with my input criterion?
I could just view my BAM file and these are the first few lines :
The entire file looks this way, with changing LN. FLAG, RNAME , POS, MAPQ, CIGAR, MRNM, MPOS, ISIZE, SEQ, QUAL, OPT all seem to be empty? Is this possible when the overall alignment rate is more than 90%?
How big is the file?
Devon, on point -- the genome size is likely the problem. Contains many short contigs (52,711 total sequences in the criGri1 aka C_griseus_v1.0 build).
skumar53, you are using a custom genome/build, correct? Try filtering the genome's fasta to exclude the short contigs included and remap against that version. This will require that you promote the custom genome to a custom build and assign that build's database (dbkey) to any resulting BAMs and the GTF for these to work correctly with many downstream tools. I added this as another check in the primary reply (item 8).
I'll probably convert all of that to a new "summary" support FAQ, but this specific issue (large/fragmented genomes/transcriptomes/exomes) can impact several tools/functions and is covered in this FAQ as a troubleshooting tip: https://galaxyproject.org/support/tool-error/#special-cases-memory-or-walltime
Thank you Devon and Jen for your insightful and informative replies! It truly went beyond the issue at hand and educated me further!
About the issue, it can be likely that the genome was the issue : I had previously aligned to both the Chinese hamster 2013 genome and the Chinese Hamster Ovary 2017 genome build and both gave the same / similar errors while obtaining the counts.
Yesterday, after reading Jen's inputs, I basically redid my entire alignment using newly downloaded fasta files from ENSEMBL, and in the process ensuring that it was from the same genome build. I used the 2017 version. Post alignment , I used featurecounts for my paired-end data and after playing around with the input options, I could finally get it working and thus obtain my counts!
It is to be noted here that I did NOT re-upload my annotated file from 2017 and so I do agree that the genome build might have caused the issue in alignment.
Nevertheless, during my attempts to get featurecounts to work, I maybe noticed a bug? My annotated file is a GTF file locally but when I uploaded it, Galaxy auto-detects it as a gff file. I had used this file before but after reading Jen's inputs, I changed the format to GTF, as required by featurecounts. Unfortunately, this did NOT work! In my next attempt, I changed it back to GFF, kept all the parameters the same (including my BAM, FASTA files and count parameters) and it worked!
So I believe there might be one of the 2 issues here : Either Galaxy does not auto-detect the file correctly or featurecounts can take in GFF files? I'm not sure which one as I'm a newbie but just thought I would let you both know about this! If you would like me to try it out once so that an error report can be generated, I could do it!
Thanks again for your inputs and to help me solve this!
Swetha
Datatype autodetect doesn't always guess the right type. Did your GTF file contain headers? If so, that can cause a GTF file to be assigned to GFF. Some tools can work with a header in a GTF file and others cannot. Some require GTF to be assigned and others will accept just GFF (but with GTF formatting in the attribute field, and not GFF3 as the data structure is very different from GFF/GTF). And just to add to the complexity, some tools accept both GTF and GFF3 format, while others only GTF or GFF that is formatted (mostly) like a GTF.
All of this is because of the different GTF/GFF3 content/formatting that various data providers release or 3rd party wrapped tools produce. There isn't a good "one size fits all" to resolve this, but we do work to construct tools in a way that at least one of these will work. It sounds like you found the right combination for your data source/format and this tool.
I had a ticket about this.. but closed it out as I do not really expect it to be addressed fully, or not anytime soon. If interested or want to add a comment/upvote it, please do. Community input can elevate the development priority for usage issues. https://github.com/galaxyproject/usegalaxy-playbook/issues/45
Appreciate the feedback!