Question: RNA Seq workflow with dataset collections
1
gravatar for sfischer
9 months ago by
sfischer50
United States
sfischer50 wrote:

Hi,

I am trying to code a simple rna seq workflow that will use dataset collections. I want to support multiple paired-end samples, with multiple replicates.

I have scoured the web, but can't seem to find instructions or examples.

I found this video instructing me how to create a multi-replicate paired-end dataset collection, which is helpful: https://vimeo.com/163625221

But I haven't found any examples of writing workflows that consume such a dataset collection. The closest I found, which isn't all that helpful, was Figure 15 here: https://galaxyproject.org/tutorials/collections/

My flow is: trimmomatic -> hisat2 -> htseq-count -> DESeq2

I don't understand how to wire it, such that all the replicates work.

There are probably examples someplace, but I don't know where to look.

Thanks, steve

ADD COMMENTlink modified 9 months ago by Jennifer Hillman Jackson25k • written 9 months ago by sfischer50
0
gravatar for Jennifer Hillman Jackson
9 months ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

Please see this tutorial for an example RNA-seq workflow that uses collections. Htseq_count can be used instead of FeatureCounts. https://galaxyproject.org/tutorials/nt_rnaseq/

After going through the tutorial, extract the workflow from the history to better understand how the datasets connect between tools. The Galaxy 101 tutorial covers how-to: https://galaxyproject.org/tutorials/g101/#creating-and-editing-a-workflow

Thanks! Jen, Galaxy team

ADD COMMENTlink written 9 months ago by Jennifer Hillman Jackson25k

Jen, Thanks!

That tutorial was helpful. I made progress, but am stuck now.

Here is my workflow. Here is my history.

The problem: HISAT2 did not run. It never seemed to queue, start, finish or error. Just radio silence.

Why would that be?

Thanks, Steve

ADD REPLYlink written 9 months ago by sfischer50

Hi Steve - Strange but glad you reported this. We've seen workflow invocation issues recently but it should be cleared up. The pre-release on Main may have a bug. I'll review/do some testing and get back to you. Thanks! Jen

ADD REPLYlink modified 9 months ago • written 9 months ago by Jennifer Hillman Jackson25k

Jen,

Something weird is going on in the workflow editor.

I reloaded my workflow. HISAT2 is no longer attached to Trimmomatic, while, when I had saved it (this morning), it was attached. (It spontaneously becoming unattached, without me knowing it, would explain why HISAT2 would not run!).

Luckily, this morning I had captured a screen shot of the way it was when I saved it: https://preview.ibb.co/gNyODH/workflow1.png

I can no longer attach HISAT2 to Trimmomatic. I’ve tried all combinations of those two tools taking in either individual files or collections.

I am a newbie, but, something seems fishy.

Steve

ADD REPLYlink written 9 months ago by sfischer50
1

Please ignore my post above about not being able to connect trimmomatic and hisat. That was user error!

The original problem I reported, that HISAT2 does not queue, fail or run is still not solved. The workflow and history are still there, for you to investigate.

Can you please take a look?

Thanks! Steve

ADD REPLYlink written 9 months ago by sfischer50

Hi Steve -

There are a few issues. The most important is that you have encountered a known issue with workflows (tool version conflicts in workflows: https://github.com/galaxyproject/galaxy/issues/4041). Trimmomatic is impacted and possibly HISAT2. The workaround is to disconnect all of the inputs/outputs for the tool (or all tools, or the first tool that doesn't get launched in test runs), then reconnect the input, next the output to as input to the next downstream tool, then that output, through any tools that do not launch. The ordering of connections is important when using this workaround - always connect the input first, the output second, for any tool impacted.

Other items:

No target reference genome was selected for HISAT2. When no target genome is chosen, the tool defaults to the first built-in index. For this tutorial's data, set the genome to be mm10. This is known usage issue that we will be addressing soon (require the selection of a target genome on tool forms: https://github.com/galaxyproject/galaxy/issues/4499)

HISAT2 needs to have the option for alignment reporting set to be tailored for StringTie. https://galaxyproject.org/tutorials/nt_rnaseq/#spliced-mapping-with-hisat

The HISAT2 output is a collection. The input type for StringTie needs to be set as a paired collection and then the first output collection from HISAT2 is input to StringTie. The data is interpreted as paired-end even though there is only one collection noodle in the workflow editor.

The primary GTF output from StringTie should be used as the reference annotation for Htseq-count (instead of the coverage output). For many use cases, the tool StringTie Merge needs to be used between these two tools to format the file properly. This can be run on a single GTF input (no mm10 reference annotation is required, although this is where you could add it in).

More steps are needed to combine all of the data for both input fastq collections, but this should get you some results as you build up the workflow. I would also strongly recommend working through an entire tutorial, or at least your own analysis path, in the history first (tool by tool, testing to see if the inputs/outputs are valid and producing the desired results) and then extracting a workflow from the history that you know already works. It can be a bit complicated to build up a workflow from scratch when first learning how to properly set tool options and how to use the editor. Once you have a functioning workflow, you can edit it further, saving back copies in between changes, to construct the final version. This makes it much easier to know which changes are impacting any problems that may emerge and need tuning.

Reference tutorials:

Hope that helps! Jen

ADD REPLYlink modified 9 months ago • written 9 months ago by Jennifer Hillman Jackson25k

Jen,

thanks much for the very thorough response.

I completely re-wired the workflow, from left to right. All the connections are multi-noodles, as expected.

As before, the HISAT2 parameter "Select a Reference Genome" is 'Set at runtime'.

I saved the workflow. https://usegalaxy.org/u/steve-fischer/w/trim-align-count-one-condition-paired

I re-ran it, choosing mus10 at run time.

Still... HISAT does not run. https://usegalaxy.org/u/steve-fischer/h/rna-seq-test-4---results-2

I'm baffled.

Thanks, Steve

ADD REPLYlink written 9 months ago by sfischer50

I'd like to add that, as a user, I feel like galaxy should be telling me what is wrong... unless this is just a bug.

Best, Steve

ADD REPLYlink modified 9 months ago • written 9 months ago by sfischer50

I'll test the runtime genome select option -- set for mm10 (what you probably meant to report).

There is the one known bug which reconnecting the noodles should solve (and that we are actively addressing with priority now). More could be going on - the server has the 18.01 pre-release loaded. Most issues are resolved, some are in progress, and all critical problems will be corrected before the release is finalized

More feedback soon.

ADD REPLYlink modified 9 months ago • written 9 months ago by Jennifer Hillman Jackson25k

I fixed the workflow for all of the items I listed out before. The usage issues were not fixed in the workflow you shared back. All jobs now launch correctly. This does not mean that your modified methods produce scientifically valid results (as it skips some of the intermediate tools included in the tutorial - you really must review what those are doing, or test/review yourself any results from modifications). But the technical part of this workflow is functioning. Galaxy can report errors for many problematic use-cases but not all. This is why ensuring a tool/analysis path works AND produces the results you want on a sample of data, then extracting a workflow, is such a useful method.

I created two versions. One has all primary output left hidden/unhidden as you originally set them, the other with all primary outputs unhidden. Intermediate or sub-datasets from collections will be always hidden by default. Click on the top of the history panel to toggle the view for all hidden.

Fixed workflow with the hide/unhide results left as you set them originally: https://usegalaxy.org/u/jen/w/copy-of-imported-trim-align-count-one-condition-paired

Resulting history: https://usegalaxy.org/u/jen/h/imported-fastq-coll-inputs-only-rna-seq-test-2----httpsbiostarusegalaxyorgp2681926837-workflow-launch-issues-2-22

Fixed workflow with all results unhidden (good idea to leave all unhidden until a workflow is in a working state - or you can set the history to display hidden datasets to watch how the jobs queue, prepare, execute in order): https://usegalaxy.org/u/jen/w/copy-of-copy-of-imported-trim-align-count-one-condition-paired----with-fixed-tool-params-noodleds-plus-all-unhidden

Resulting history: https://usegalaxy.org/u/jen/h/imported-fastq-coll-inputs-only-rna-seq-test-2----httpsbiostarusegalaxyorgp2681926837-workflow-launch-issues-2-22-using-workflow-with-no-hidden-outputs

ADD REPLYlink modified 9 months ago • written 9 months ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 167 users visited in the last hour