Featurecounts for paired end RNA-seq reads not generating counts

Question: Featurecounts for paired end RNA-seq reads not generating counts

6 months ago by

skumar53 • 30 wrote:

I have paired end RNA-Seq reads that I aligned to my reference genome using HISAT2 (alignment mostly >90%) and obtained BAM files, as a result. When I tried to obtain count information for these BAM files using the annotated file of the same build of the reference genome, I find that the resulting file has count values of 0, throughout.

I used exon as the GFF feature, excluded chimeric fragments, only allowed fragments with both reads aligned and allowed exon-exon junctions. The rest were all default settings, including not counting multimapping reads. I could see that there was a junction counts file having values for counts, though.

This is my counts summary :
Assigned 0
Unassigned_Unmapped 6571202
Unassigned_MappingQuality 5083205
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 0
Unassigned_Secondary 0
Unassigned_Nonjunction 0
Unassigned_NoFeatures 64827954
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 0

Could someone help me with this, please?

Thanks!

rna-seq featurecounts hisat2 • 708 views

ADD COMMENT • link •

modified 6 months ago by Jennifer Hillman Jackson ♦ 25k • written 6 months ago by skumar53 • 30

It's likely that the GFF (try to find a GTF file, they tend to work better) didn't match your genome version. To check this, load both a BAM file and your GFF file in IGV and spot check a few genes too see if the lack of counts is reasonable.

ADD REPLY • link written 6 months ago by Devon Ryan • 1.9k

Thanks for your reply Devon!

I uploaded a GTF file and even compared 2 runs : one with mismatched genome version and the other with the right genome version and both gave me the same results. So I'm not sure what next to do.

I viewed the GTF file in IGV and it does have chromosome start and end locations, features, etc. - characteristic of an annotation file. I am unable to view my BAM file on usegalaxy.org as it keeps crashing (due to the size I guess?).

Do you think I've made a mistake with my input criterion?

ADD REPLY • link written 6 months ago by skumar53 • 30

I could just view my BAM file and these are the first few lines :

@HD VN:1.0 SO:coordinate
@SQ SN:AMDS01163391 LN:201
@SQ SN:AMDS01163392 LN:201

The entire file looks this way, with changing LN. FLAG, RNAME , POS, MAPQ, CIGAR, MRNM, MPOS, ISIZE, SEQ, QUAL, OPT all seem to be empty? Is this possible when the overall alignment rate is more than 90%?

ADD REPLY • link written 6 months ago by skumar53 • 30

How big is the file?

ADD REPLY • link written 6 months ago by Devon Ryan • 1.9k

Devon, on point -- the genome size is likely the problem. Contains many short contigs (52,711 total sequences in the criGri1 aka C_griseus_v1.0 build).

skumar53, you are using a custom genome/build, correct? Try filtering the genome's fasta to exclude the short contigs included and remap against that version. This will require that you promote the custom genome to a custom build and assign that build's database (dbkey) to any resulting BAMs and the GTF for these to work correctly with many downstream tools. I added this as another check in the primary reply (item 8).

I'll probably convert all of that to a new "summary" support FAQ, but this specific issue (large/fragmented genomes/transcriptomes/exomes) can impact several tools/functions and is covered in this FAQ as a troubleshooting tip: https://galaxyproject.org/support/tool-error/#special-cases-memory-or-walltime

ADD REPLY • link modified 6 months ago • written 6 months ago by Jennifer Hillman Jackson ♦ 25k

Thank you Devon and Jen for your insightful and informative replies! It truly went beyond the issue at hand and educated me further!

About the issue, it can be likely that the genome was the issue : I had previously aligned to both the Chinese hamster 2013 genome and the Chinese Hamster Ovary 2017 genome build and both gave the same / similar errors while obtaining the counts.

Yesterday, after reading Jen's inputs, I basically redid my entire alignment using newly downloaded fasta files from ENSEMBL, and in the process ensuring that it was from the same genome build. I used the 2017 version. Post alignment , I used featurecounts for my paired-end data and after playing around with the input options, I could finally get it working and thus obtain my counts!

It is to be noted here that I did NOT re-upload my annotated file from 2017 and so I do agree that the genome build might have caused the issue in alignment.

Nevertheless, during my attempts to get featurecounts to work, I maybe noticed a bug? My annotated file is a GTF file locally but when I uploaded it, Galaxy auto-detects it as a gff file. I had used this file before but after reading Jen's inputs, I changed the format to GTF, as required by featurecounts. Unfortunately, this did NOT work! In my next attempt, I changed it back to GFF, kept all the parameters the same (including my BAM, FASTA files and count parameters) and it worked!

So I believe there might be one of the 2 issues here : Either Galaxy does not auto-detect the file correctly or featurecounts can take in GFF files? I'm not sure which one as I'm a newbie but just thought I would let you both know about this! If you would like me to try it out once so that an error report can be generated, I could do it!

Thanks again for your inputs and to help me solve this!

Swetha

ADD REPLY • link modified 6 months ago • written 6 months ago by skumar53 • 30

Datatype autodetect doesn't always guess the right type. Did your GTF file contain headers? If so, that can cause a GTF file to be assigned to GFF. Some tools can work with a header in a GTF file and others cannot. Some require GTF to be assigned and others will accept just GFF (but with GTF formatting in the attribute field, and not GFF3 as the data structure is very different from GFF/GTF). And just to add to the complexity, some tools accept both GTF and GFF3 format, while others only GTF or GFF that is formatted (mostly) like a GTF.

All of this is because of the different GTF/GFF3 content/formatting that various data providers release or 3rd party wrapped tools produce. There isn't a good "one size fits all" to resolve this, but we do work to construct tools in a way that at least one of these will work. It sounds like you found the right combination for your data source/format and this tool.

I had a ticket about this.. but closed it out as I do not really expect it to be addressed fully, or not anytime soon. If interested or want to add a comment/upvote it, please do. Community input can elevate the development priority for usage issues. https://github.com/galaxyproject/usegalaxy-playbook/issues/45

Appreciate the feedback!

ADD REPLY • link written 6 months ago by Jennifer Hillman Jackson ♦ 25k

6 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Common places to diagnose empty/unexpected Featurecounts results:

Review the featurecounts "summary" output. This can provide clues about why mapped reads were omitted from the counts in the "assigned" versus the "unassigned" categories. Prior Q&A at the tool's Google group (bit older post, but still valid): https://groups.google.com/forum/#!topic/subread/JPPw9lVfgpw
Confirm that the reference annotation GTF is both an actual GTF file or a GFF file with GTF-formatted attributes. GFF3 annotation is structured differently and cannot be used with this tool directly. The tool gffread can convert GFF3 to GTF. FAQ: Common datatypes explained https://galaxyproject.org/learn/datatypes/
Confirm the annotation dataset is based on the same exact genome build used with mapping the reads with HISAT2. It sounds like you have already done this. But, to help others that find this post and also need to troubleshoot, please see this support FAQ: Mismatched Chromosome identifiers (and how to avoid them) https://galaxyproject.org/support/chrom-identifiers/
Review the BAM's annotation lines. These are contained in the dataset after the header lines (lines that start with "@"). The attributes for type and identifier are important -- see the next item.
Review the Featurecount settings under Advanced Options. Defaults are: GFF feature type filter is "exon" and GFF gene identifier is "gene_id". These should be present in your reference annotation (item 4 above). The values can be adjusted if you want to count up a different type/identifier. Or, you might need to switch to using an annotation dataset that contains these attributes. Avoid using a GTF from the UCSC table browser -- Why? the gene_id and transcript_id are both the same value (transcript), meaning that counts will effectively be "by transcript" and not grouped "by gene".
Review the fastq data's strand assignment option set with both HISAT2 and Featurecounts. These should be matched, yet more importantly, actually match how the fastq library was created and sequenced. The strand orientation can vary between sequencing methods/protocols. https://galaxyproject.org/tutorials/ngs/#paired-end-data
Open the BAM and GTF in a browser and examine a few regions where counts are expected (a common gene's location, etc). If you don't find alignments at the same genomic coordinates as the annotation, this also can indicate a mismatch or data content/tool settings issue. Be aware that not all mapped reads will be counted and that is expected -- some will be excluded, "why" is what the "unassigned" lines in the summary report in item #1 above represent. Choices for browser visualization include Galaxy's built-in browser found under Visualize >> Create visualization.
If using a custom genome, it must first be promoted to a custom build, then any datasets that you want to link into the visualization need to have that custom build assigned as the "database". Featurecounts also requires this same database assignment for BAM/GTF inputs when using a custom genome. Make sure your genome/transcriptome/exome represents a reasonable number of sequences, as any over a few thousand will be problematic. FAQs: Preparing and using a Custom Reference Genome or Build https://galaxyproject.org/learn/custom-genomes/ && How do I find, adjust, and/or correct metadata? https://galaxyproject.org/support/metadata/ && My job ended with an error. What can I do? https://galaxyproject.org/support/tool-error/#special-cases-memory-or-walltime

Galaxy tutorials, including those for RNA-seq with example usage for HISAT2, Featurecounts (and up/downstream tools/methods): https://galaxyproject.org/learn/

Support FAQs: https://galaxyproject.org/support/

If you cannot determine the problem after checking the above/making changes (as needed) and can reproduce the problem at Galaxy Main https://usegalaxy.org (or really any Galaxy server with public access), send an email to galaxy-bugs@lists.galaxyproject.org that includes a shared history link. Make sure all inputs/outputs are undeleted, include the Featurecount dataset number that presents this problem, and add in the link to your post here so we can associate the two. How-to: https://galaxyproject.org/support/#unexpected-results >> Reporting Usage Issues or Software bugs https://galaxyproject.org/issues/

Thanks! Jen, Galaxy team

ps: This is a longer reply, but I wanted to put everything in one place so this can become a reference post for everyone encountering similar unexpected Featurecounts results.

Updated 5-23-2018

ADD COMMENT • link modified 6 months ago • written 6 months ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »