Is it possible for workflow parameters to take information from the names of input datasets? e.g. ${Read_group } is set to be the first three characters of input dataset name.

Question: Is it possible for workflow parameters to take information from the names of input datasets? e.g. ${Read_group } is set to be the first three characters of input dataset name.

3.7 years ago by

Guy Reeves • 1.0k

Germany

Guy Reeves • 1.0k wrote:

HI. I would like to run a workflow in parallel on list of multiple data sets (using the dataset lists options), within the workflow I need to map using BWA and would like to add the appropriate ‘readgroup’, ‘library’, and ‘sample’ parameters for each of the input datasets. All the information I need is included in the input dataset name e.g.’sample’ is the first 3 characters dataset name ‘library’ is the 4-6 characters of the name…

I have found variants of my question from several years ago, which indicated that it is only possible to use input dataset names for renaming output files using #{-inputFile-}.

Is this still the case that ${ } parameters set by the user at the beginning of work flows cannot use information in dataset names?

Thanks Guy

$ bwa parameter galaxy • 1.5k views

ADD COMMENT • link •

modified 3.7 years ago • written 3.7 years ago by Guy Reeves • 1.0k

3.7 years ago by

Dannon Baker ♦ 3.7k

United States

Dannon Baker ♦ 3.7k wrote:

Yes, this is still the case. If you'd like, feel free to create a Trello card detailing your request at http://galaxyproject.org/trello/

ADD COMMENT • link written 3.7 years ago by Dannon Baker ♦ 3.7k

I have no idea if this trello card has attracted any attention ( I do think I have the privileges to see it). But I wanted if it is possible to get an idea how feasible it would be to modify a wrapper for BWA for illumina so the -r option took information from the input dataset name (‘readgroup’, ‘library’, and ‘sample’). I have a limited python but if anybody thought this was a small project i might try to have a go, unless somebody else could have a go (or had already implemented it).

Thanks Guy

ADD REPLY • link written 3.7 years ago by Guy Reeves • 1.0k

3.7 years ago by

Guy Reeves • 1.0k

Germany

Guy Reeves • 1.0k wrote:

Thanks I have just a Trello card.

https://trello.com/c/hFfuIHE6/2572-idea-extending-parameters-for-ngs-mappers-to-take-information-from-input-dataset-names

if anybody agrees with promoting this suggestion please do vote for it.

Thanks Guy

ADD COMMENT • link written 3.7 years ago by Guy Reeves • 1.0k

I thought I would cross-post what I wrote on the trello site incase anybody was interested ( I guess you can upvote this here if you are not registered on trello- click the thumbs up icon)

IDEA: Extending ${} parameters for NGS mappers to take information from input dataset names.

The excellent data collection / list options already implemented in Galaxy allow users to easily set up simultaneous instances of work flows (that generate results in separate histories). However, in any workflow including a NGS mapping tool it is not possible to do this and still specify necessary variable parameters like readgroup (ID or @RG) , library (LB) and sample (SM)- despite the fact that all this information is in the fasta.gz dataset name.

Seems to me that all the key functionality is already implemented in Galaxy to permit easy parallel runs of workflows including mapping tools —except the capacity to take library and sample information from fasta.gz dataset names.

If the current workflow ${} parameter option could be extended to take information from input fasta.gz dataset names into NGS mapping tools this would, I think, remove the last major limitation of Galaxy over command line and ensure that users were able to deal with the increasing use of multi-lane NGS sequencing machines and multiplexing of ever larger numbers of samples within the same sequencing cell.

If any of this is not clear or if I have missed an obvious way to implement this please do tell me.
I guess that this capacity has been considered in the past and maybe there are good reasons not to do this, but I think that as the number of lanes on NGS sequencing machines and sample multiplexing increase it would be a really really useful addition to all great the existing functionality.
Thanks

Guy

Slightly more detailed explanation of above.
With the increasing use of Next Gen sequencing machines with multiple lanes (4-16 lanes, resulting in corresponding number of readgroups for each sample). In addition to larger and larger numbers of samples being multiplexed in single sequence run it is increasingly impractical to manually set key parameters of the most common step mapping step in NGS workflows (RG, LB, SM). Regardless, if this done at the time of mapping or after using ‘Add or Replace Read Groups’ tool.
For example, I use an illumina machine where each library will have fasta.gz files from 4 lanes. Furthermore, even on the mid-output sequencing cells it is possible to multiplex >100 libraries within the same run ( in my case Drosophila genomes). This results in >400 readgroup by library combinations needing to be specified for every sequencing run (I use the ‘BWA for illumina ‘tool).

At least on iillumina machines it is possible for users to specify part of the output file names (through the sample sheet used by the sequencing machine). Meaning that all key mapping parameters could easily be placed in the fasta.gz file names (possibly separated with special characters). Readgroup (RG) can be taken from the lane number L00{1-4}, which is already part of the standard output.

If the current workflow ${} parameter input was extended to NGS mapping tools this would I think remove the last major limitation of Galaxy over command line and ensure that users were able to deal with the increasing use of multilane sequencing machines and multiplexing of ever larger numbers of samples within the same sequencing cell.

ADD REPLY • link written 3.7 years ago by Guy Reeves • 1.0k

Please log in to add an answer.

IDEA: Extending ${} parameters for NGS mappers to take information from input dataset names.

Similar posts • Search »