Hi Brad,
I want to agree with you and maybe there is a way for sniffing... but I think blanket auto-assignment of ".fastqsanger" would be a mistake. I happen to see a great many datasets that are incorrectly labeled as .fastqsanger that are in fact not. Public repositories in particular have older data versions associated with important studies.
And for a variety of root-cause reasons, incorrect assumption of datatype (without QA first, using FastQC is our recommendation) is probably one of the most common issues that comes up, sadly often at the very end of an analysis when poor mapping results were not noticed and downstream tools then produced wildly unexpected results or failed for lack of usable data or even in analysis where another issue was reported and I notice this as off (I tend to check everything in shared histories for basic QA - full service!). So, I see this in both histories where datatype is a problem suspected AND unsuspected.
Auto-assignment would mean quickly heading an input .fastq file, detecting scaling, then datatype assignment. While these functions can go somewhat quickly, they will add time to the upload processing if done as new functions or using current tools (will be "jobs" - at least two more, upload is also one - so three total minimum). That said, perhaps a smart way to do this could be created. And have it be optional, to save time when a known. Then instructions about what to do next (groomer) when not .fastqsanger (e.g. .fastqillumina, or .fastqcs* of some type).
But this is not so different than what we recommend now (the Support link/video I frequently send out about initial QA) - that end users do the quality scaling check, then groom/assign, before proceeding with QC and analysis. To help with proper FASTQ Groomer settings, or to even see if needed, we recommend running FastQC first - and this is the tool (it or just the portion that detects quality scaling) that I am talking about for "jobs" associated with detecting type. Running on entire datasets takes longer, and isn't needed first pass (only after quality scaling is adjusted/assigned as needed, and then for QC purposes - also an important step). To run on just a sample, the Text Manipulation tool "Select first lines" is the other "job" (or a tool/action similar to it).
I do agree completely that the question "what type of .fastq data do I have?" comes up often. And that for new Galaxy users, knowing what to do first is not guided by a gui or automated - but is noted in just about every place we can squeeze it in. It is worth thinking about and perhaps could be improved upon. Workflows are not a great option here, as there are decisions to be made dependent on results, but that too may be part of a potential solution.
I'll let you and the developers consider the use cases and discuss. My guess is that the deciding factor will be run-time: could this be simplified, while retaining optional execution, in a way that does not greatly extended the execution time of Upload? Then why not auto-detect/assign .fastqsanger, .fastqillumina, etc. if it is an unknown? (you won't need it if you already know what your data is... if very certain from a newer Illumina pipeline, for example - just assign .fastqsanger at upload yourself, skipping auto-detect).
Excellent discussion point. Anyone that has an opinion should post!
Best, Jen, Galaxy team