Unable to select GTF file from history in featureCounts (Galaxy version 1.6.0.3)

Question: Unable to select GTF file from history in featureCounts (Galaxy version 1.6.0.3)

6 months ago by

Hello,

I have recently begun mapping Drosophila RNA-Seq data with STAR (in Galaxy), and I am now attempting to count the number of reads aligning to features using featureCounts.

Though the gtf file (from Ensembl) I provided to STAR is still in my Galaxy history, I am unable to select it in the featureCounts tool. The file is still seen as a gtf option in the STAR tool.

I tried uploading a duplicate file, both zipped and already unzipped, and tried changing the file type to either gff or gff3, but I couldn't resolve the issue.

In case this information helps, Galaxy will auto-assign the gtf file in question as a gff if given the chance, but specifying the file as gtf either before or after upload hasn't resolved the issue.

Thanks!

rna-seq software error galaxy • 653 views

ADD COMMENT • link •

modified 6 months ago by stephythomas222 • 0 • written 6 months ago by j.christopher.rounds • 20

6 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

The input BAM must have the database assignment made when using this tool. The input reference annotation should have the same database assignment made. If the genome was one already indexed on the server (a genome you used directly from Galaxy, not a custom genome) this is normally assigned to the BAM result with mapping tools. However, STAR is not automatically assigning the database attribute as it should (we are working to fix this). Meanwhile, you could directly assign the database if not using a custom genome.

Note: Featurecounts might have a small bug when using a custom genome. Is that your case as well? See this other post for details: https://biostar.usegalaxy.org/p/27973/. If the same issue, we can merge the two for updates. I am testing to see if a Custom Build is a potential workaround (could be used before the tool is updated).

And yes, the GTF datatype assignment with the Upload tool can be problematic with data from certain sources that include header/comment lines. That is another fix in progress but at a lower priority. Meanwhile, just assign the datatype with Upload or after by editing the attributes. Many tools will work with the header intact but some will not. Or, you could remove the comment lines before Upload (and GTF will be autodetected correctly. Or, you can do the correction in Galaxy. Removing those comment lines at the top will avoid confusing errors/problems. I don't think that Featurecounts is impacted, but if the database assignments are the same between your BAM and GTF, both have the correct datatype assigned, and the tool still refused to recognize the GTF, you could try the reformatting to see if it helps. Use the tool "Remove beginning of a file" if you choose to do this within Galaxy. I'll be including GTFs with/without headers in my featurecounts tests.

UCSC is one choice for an alternative GTF that wouldn't have a header included. However, the problem with GTFs from that source is that the gene_id and transcript_id values are the same when extracted from the Table Browser. Both values are the transcript name. This effectively means that all counts will be made by "transcript" and not grouped "gene". This creates scientific content problems with the output from many tools (not just this one). iGenomes GTFs are well formatted, as are those from Gencode, and avoid all of these format/usage issues.

Thanks for reporting problems! Jen, Galaxy team

ADD COMMENT • link written 6 months ago by Jennifer Hillman Jackson ♦ 25k

Hi Jen,

Excellent - assigning the same database to both my BAM output file from STAR and to my GTF annotation file solves the problem, allowing me to select the GTF file from the featureCounts tool! Thank you so much for the help.

That said, I am using a custom genome for alignment with STAR - specifically, I am using the dm6 Drosophila melanogaster genome assembly from Ensembl (with Ensembl conventions for naming chromosomes and contigs) to pair with the Ensembl GTF annotation file. Thanks to you, I just learned about and set up this reference genome as a Custom Build. Assigning this build as the database attribute for both my BAM and GTF files does allow me to choose them both and complete a (seemingly) successful run in featureCounts!

On this point, for my own understanding - am I correct that featureCounts doesn't use the reference genome for counting, only the alignment file and the annotation file? Could any database technically be assigned to both the BAM and GTF file in this case without affecting the results? I'm sure that could be an issue for other/downstream analyses, but I'm just curious how the tool works precisely.

Additionally, thank you for the explanation on why and how to reformat my GTF file to ensure full compatibility. Intriguingly, I did just try removing the comment lines locally and uploading the updated file, but it was still auto-detected as a GFF. Fortunately, using the GTF file with a header doesn't seem to have interfered with featureCounts, but I'll go ahead and use the headerless file in the future to avoid potential problems.

Finally, thank you for the explanation of the potential issues with using GTF files from UCSC. I had gotten the sense they were inappropriate to use for my purposes based on UCSC's file descriptions, but I had not seen why stated so clearly. If I run into any issues using the GTF file from Ensembl, I'll be sure to check out those from iGenomes and/or Gencode.

Thank you so much Jen - you've been an enormous help!

ADD REPLY • link written 6 months ago by j.christopher.rounds • 20

Very glad this worked out!

I am not sure if you could just randomly assign a matching database assignment (that technically is not what the data content is based on) and still get valid results with Featurecounts. You could test if really wanted to. But, using the correct CG build will allow you to use other tools that definitely make use of the reference genome itself through the database metadata assignment (this is not always obvious).

So, it is best to label data correctly to avoid problems, even if some tools don't specifically defend against all the possible ways to circumvent intended usage.

For some gtf sources, the datatype gff will be autodetected even without headers. This is due to differences between how the attributes in the file are formatted. When Galaxy guesses wrong, double check format yourself against the specification, then directly assign after Upload, or even better, directly assigning the datatype during Upload when you are certain the data meets a specific format already.

And yes, some tools are more picky about formats than others. Using the most basic version, and avoiding specialized versions that some sources provide, will avoid odd tool errors and/or scientifically invalid results.

From what you've stated, I don't think your current results are problematic - the usage sounds correct for featurecounts.

Thanks!

ADD REPLY • link modified 6 months ago • written 6 months ago by Jennifer Hillman Jackson ♦ 25k

I did some testing and any database can be assigned to both the GFT/BAM and the tool will produce valid results. But I still would not recommend doing that. Use a custom build assignment as you are now.

I've reviewed the tool some more and now think requiring the database assignment, even if from a custom build assignment, is best. If that is not enforced, the frequency of usage-errors spikes considerably, which is why we added that database-match check between inputs in the first place. It requires end-users to confirm via the database assignment that the inputs are actually from the same build (or as far as they know). Some mismatch-related errors still do come up but by far fewer than with the prior tool versions that did not require the input's database assignment.

ADD REPLY • link written 6 months ago by Jennifer Hillman Jackson ♦ 25k

Hey Jen, Thank you so much for your help with this, for your advice and explanation on gtf/gff file formats, and for performing those tests! I will definitely keep track of my database assignments and assign the appropriate sequences to any files to be analyzed. But, I do appreciate the insight into how featurecounts functions.

And, your assessment of requiring a database assignment to use this tool is very sound, I think. I’m sure having guardrails like these in place really helps prevent usage errors and increases the likelihood of reliable analyses. It certainly ended up helping and teaching me!

I will say that as a new user, I do think it’s not clear in the featureCounts tool that the GTF file must have a database assignment, and that this assignment must match that of the BAM file. That the BAM file must have a database assignment is made clear, but the text under the Gene Annotation File drop-down menu only says : “The program assumes that the provided annotation file is in GTF format. Make sure that the gene annotation file corresponds to the same reference genome as used for the alignment”

I know this is a very low priority change, but – could a request be put in to update this text slightly? In part using the text under the BAM drop-down, the addition could read something like:

“These files must have the database/genome attribute already specified e.g. hg38, not the default: ?, and this attribute must match the assignment of the loaded BAM file.”

Regardless, thanks again for your help Jen!

ADD REPLY • link written 6 months ago by j.christopher.rounds • 20

This is a good idea. We've been discussing something very similar at Gitter (along with better help for the built-in annotation). The tool is going to be updated again -- and this could be added, too. You can make a proposed change yourself with a Pull Request against the tool repository here: https://github.com/galaxyproject/tools-iuc/tree/master/tools/featurecounts. Or, if you'd like the tool author to consider it first, click on Issues and make an enhancement request -- add @mblue to the ticket to get this routed faster to the right person. Thanks!

ADD REPLY • link written 6 months ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »