Workflow Improvement Requests (Long)

Question: Workflow Improvement Requests (Long)

10.0 years ago by

United States

Assaf Gordon • 320 wrote:

Dear all, Recently, users (of our local galaxy server) started using workflows, and are very pleased. However, as workflows get more complicated, it gets harder to track the input and output of the workflows. I'd like to share an example, to illustrate the problems that we encounter. The workflow (pictured in the attached 'workflow.jpg') takes 4 input datasets, and produces 4 output datasets. The first problem is that there's no way to differentiate between the input datasets (They appear simply as "Step 1: Input dataset", "Step 2: Input Dataset", etc). Since each dataset has a specific role, I've had to print the workflow and give the users instructions as to which dataset (in their history) goes into what dataset. (see attached 'crosstab_workflow_input_datasets.jpg'). The second problem is that whenever I change something in the workflow and save it - the order of the dataset change! So what was once dataset 1, can now be dataset 2,3 or 4. Users have no way of knowing this... (keen users might notice the the description of the first tool changed from "Output dataset 'output' from step 2" to "Output dataset' output' from step 4" - but this is very obscure...). The third problem is that once the workflow completes, the resulting dataset have cryptic names such as "Join two queries on Data 10 and Data 2". Since "Data 10" is "Awk on Data 8" and data-8 is "Generic Annotations on Data 7 and Data 1" and data-7 is "Intersect data 1 and data 6" - it gets a bit hard to know what's going on. (see attached 'crosstab_history.png'). For the meantime, I've simply gave written instructions on what each dataset means (see attached 'crosstab_workflow_dataset_explnanations.jpg). If I may suggest a feature - it would be great if I could name a dataset inside the workflow. Instead of naming it "Input dataset" I could give it a descriptive name, so even if the order of the input datasets changes, users will know which dataset goes into which input. Regarding the output dataset names, the 'label' option in the tools' XML is a good start, but still creates very long, hard-to-understand names. Another great feature would be the possibility to add an 'output label' for each step in the workflow. Regardless of the above, I'd like to say (once again) that Galaxy is a great tool, and workflows are really cool - we have several long workflows which do wonderful things. Thanks for reading so far, Gordon.

galaxy • 1.2k views

ADD COMMENT • link •

modified 10.0 years ago by James Taylor • 70 • written 10.0 years ago by Assaf Gordon • 320

10.0 years ago by

Gunnar Raetsch • 60

Gunnar Raetsch • 60 wrote:

Dear Assaf and everybody else, I can only reinforce what you said: Great work! ... and that I had similar problems. In particular, when working with workflows that have say 50 different steps, things can become very confusing. It would help, if one can define outputs of the workflow and hide all the steps in the history that are inside the workflow and not related to inputs and outputs. Another feature that I would find be very helpful in designing larger workflows would be if one could use workflows within a larger workflow. In my case I have set of tasks that have to be repeated using several different settings within a larger workflow. I realize that workflows are still in beta and that it might be too early to ask for such features... but it would be great to see them in beta soon. Thanks a lot for your efforts! Gunnar +-------------------------------------------------------------------+ Gunnar Rätsch http://www.fml.mpg.de/raetsch Friedrich Miescher Laboratory Gunnar.Raetsch@tuebingen.mpg.de Max Planck Society Tel: (+49) 7071 601 820 Spemannstraße 39, 72076 Tübingen, Germany Fax: (+49) 7071 601 801

ADD COMMENT • link written 10.0 years ago by Gunnar Raetsch • 60

10.0 years ago by

James Casbon • 370

James Casbon • 370 wrote:

Hi Everyone, Slightly off-topic, but I see you have awk in your workflows. Awk could work on text, tabular, and other formats but I'd rather not define a new tool for each input type. Is there a way to define a tool which accepts more any type of input? It should ideally preserve the format in the output as well. thanks, James 2008/11/14 Assaf Gordon <gordon@cshl.edu>:

ADD COMMENT • link written 10.0 years ago by James Casbon • 370

James, The datatypes are a hierarchy, and tools will accept any type that is more specific than their defined input type. If you set the input type to "data" the tool will accept anything, if you set it to "text" it will accept any text format. For outputs, there is a special format "input" which copies the type of the input dataset (first input I believe, this needs to be enhanced to allow specifying a particular input). There is also the "metadata_source" attribute for copying the input metadata. This is how many of our tools that work on tabular data preserve the type and metadata of "interval" format files. -- jt

ADD REPLY • link written 10.0 years ago by James Taylor • 70

Great, thanks a lot. You're way ahead of me here ;) 2008/11/25 James Taylor <james.taylor@emory.edu>:

ADD REPLY • link written 10.0 years ago by James Casbon • 370

Indeed, the 'awk' tool accepts 'format="txt"' and therefore can handle almost any file in Galaxy. Regarding your other question ('user parameters ending up on command line'), here's my suggestion: In the <command> section, enclose the parameter in single-quotes (make sure it's single and not double): <command interpreter="sh">awk_wrapper.sh $input $output '$file_data'</command> In the program parameter (where users can enter whatever they want), add a validator to prevent single-quotes: <param name="file_data" type="text" area="true" size="5x35" label="AWK Program" help=""> <validator type="expression" message="Invalid Program!">value.find('\'')==-1</validator> </param> This way the parameters the user enter will always be single-quoted, and not parsed by the shell. -Gordon. James Casbon wrote, On 11/25/2008 12:31 PM:

ADD REPLY • link written 10.0 years ago by Assaf Gordon • 320

10.0 years ago by

Eric Schauberger • 10

Eric Schauberger • 10 wrote:

I second the request on sometype of labeling system for the workflow-- at least a numbering system. I made a workflow with many inputs, then when I tried it out I realized that the first input that I was joining with the second input was intermixed and unidentifable. Then I realized that the inputs are ordered in their creation order and not anytype of order how they are placed. Since I was making many, many, inputs I simply made a bunch of them at once and didn't keep track of their order or where I put them. Thank again for the sweet tool. Eric -- ________________________________________________________ Eric M Schauberger Physician Scientist Training Program (DO/PhD) Genetics Program Ewart Lab MSU College of Osteopathic Medicine (MSUCOM) Email: Schaube2@msu.edu Skype: Emschaub See my availability: http://www.timebridge.com/mytime/eschauberger __________________________________________________________

ADD COMMENT • link written 10.0 years ago by Eric Schauberger • 10

10.0 years ago by

James Taylor • 70

James Taylor • 70 wrote:

Okay, input dataset labels are (finally ;) implemented in 15bf910890d5 which is r1647 in central and 1907 in the security development branch. In the editor form, you can provide a name for any input dataset, and it will be displayed as the row label in the run form. The support for this is pretty generic, and this changeset may help anyone wanting to try adding more parameters to a workflow module (like constraining input dataset type). -- jt

ADD COMMENT • link written 10.0 years ago by James Taylor • 70

Please log in to add an answer.

Similar posts • Search »