I'm new to StringTie and I've been trying following the "Finding and quantifying new transcripts" at https://galaxyproject.org/tutorials/nt_rnaseq/.
It is not clear to me why it is recommended to only use a guide gff file when merging the StringTie data from all individual samples? What is the difference between doing this or always using the gff file (both when running StringTie and StringTie Merge) or using it only in the initial StringTie run of each sample?
I would appreciate if someone could give me some insight into this. Thank you.
The original reference GTF contains known transcripts/genes (known).
The results from Stringtie include the novel (and presumably at least some knowns, if there are any for your genome) transcripts/genes unless you restrict it to only report knowns from a reference GTF. So, there can be three kinds of results. (only the knowns represented by your reads or those knowns + "guided" novel or those knowns + unguided novel)
StringTie merge combines and reformats GTF data. This can have two kinds of content.
If given just the original reference GTF to fix up the formatting (often a required first step), the content does not change at all (original known).
If given Stringtie output and the fixed up reference GTF together, the content reflects discovery, if any, from your read data (original known + novel).
If you do not care about novel isoforms/transcripts/genes (discovery), then do not include/create/merge/consider novel data in the analysis.
If you do care about novel data, be sure to allow it to be created and not filtered out. Knowns can be used as a guide, or not, depending on if you want those to influence how the data assemble or not.