Multiple Tool Output Help!

Question: Multiple Tool Output Help!

4.2 years ago by

Marija • 20

Canada

Marija • 20 wrote:

Trying to get a tool working on galaxy.

The tool outputs two files. I need them to be associated with each other. Is the only way to go about this the composite datatypes thing?

If so, how exactly does composite datatypes work?

Thanks for the help!

help galaxy tools output • 996 views

ADD COMMENT • link •

modified 4.2 years ago by fubar ♦ 1.1k • written 4.2 years ago by Marija • 20

4.2 years ago by

fubar ♦ 1.1k

Australia

fubar ♦ 1.1k wrote:

Glad to hear you are developing tools.

Briefly, the answer(s) depend on details you haven't given such as what you mean by "associated" and whether the related file formats are identical (eg paired end reads) but the bottom line is that you are likely heading into advanced tool wrangling whatever you mean. Once you have specific and well posed questions, you're much more likely to get specific answers here - but here are some pointers:

Composite objects are one way to associate two or more datasets into a single history object. A text search of the wiki will find things like https://wiki.galaxyproject.org/Admin/Datatypes/Composite%20Datatypes?highlight=%28composite%29 which may go some way to answering some of your questions. Tool source code and some experimenting are your best friends here because the documentation is far from ideal and for many programmers, the best way to understand them is to read the code for tools that use them - eg the fastqc tool.

If the files are in the same format, there's a newer approach to (eg) pairing paired end read files or creating groups of files that need to stay together called dataset collections. These are relatively new and so under development and AFAIK can only be created through the bioblend API which you can read about @ http://bioblend.readthedocs.org/en/latest/api_docs/galaxy/all.html?highlight=collections

ADD COMMENT • link written 4.2 years ago by fubar ♦ 1.1k

Thanks for your response! Yea, I didn't know how much detail to post because I didn't want to make a wall of text but...

By associated I mean the tool uses one file as an input but the other file has to be in the same directory as the first file for the tool to work. Its kinda like how if you have a .bam file and the corresponding indexed .bai file and you want to use that in a tool you just need to load in the .bam file and then the tool knows to use the .bai file as well but they have to be in the same directory and have the same name before the extension (ex1.bam and ex1.bai)... Think kinda how IGV uses bam or sam as the input format but you need to have an indexed version too. The related file formats are not identical (like how .bam and .bai aren't identical).

I'll use ex1.bam and ex1.bai to help illustrate my problem:

My problem isn't with the tool that needs to use ex1.bam (and know where ex1.bai is and use it) my problem is with the tool that makes those files (Im trying to make both work).

If I tell Galaxy to output only one file it outputs the ex1.bam file to the output directory which holds outputs (/galaxy-dist/database/files/000/) which is good but then the ex1.bai file is lost (because Galaxy uses temporary directories to store files needed to run the tool and then deletes them after its done) so that doesn't work.

If I tell Galaxy to output two files it outputs the ex1.bam file and the ex1.bai file to the output directory but Galaxy does it's own thing with naming them (so the files would be dataset_01.dat and dataset_02.dat). That doesn't work because now it is impossible to figure out which file is the ex1.bai file that is associated with the ex1.bam file because they are just numbered and furthermore the tool doesn't like it because of the way Galaxy names output files it would need another file called dataset_01.bai for it to be able to know that it is the file associated with dataset_01.dat... I've gone in and manually changed the file name to dataset_01.bai (using command line after running the tool that makes the files from Galaxy) and tried running the second tool (that uses the associated files) and it worked. But when I tried adding an extra command in extool.xml (for the tool that makes the files) that changes the name what ends up happening is it doesn't change it to dataset_01.bai, it makes it dataset_01.dat.bai so that doesn't work either and honestly I would prefer to stay away from changing the names of the files because I don't want to mess around with Galaxys own structure and naming conventions and what not.

Basically, from what I have been able to figure out about it from googling and the galaxy wiki, a composite datatype is kinda like outputing a directory and then those two files are inside of the directory?... Honestly though Im having trouble grasping it. But if thats what it is it would be useful because then I could put the files in the current working directory (the temporary directory Galaxy makes for each job/each time it runs a tool) and rename them and everything would be happy.

If you have any suggestions for what to do in regards to how to handle the two files in galaxy, or any other information on composite datatypes that would help me understand it better (if it would be useful for this) that would be really great :)

Also, would you suggest reposting my question but using this explaination?

Thanks again, and sorry for the huge response.

ADD REPLY • link modified 4.2 years ago • written 4.2 years ago by Marija • 20

4.2 years ago by

fubar ♦ 1.1k

Australia

fubar ♦ 1.1k wrote:

Walls of text are fine and here's one in response! In a sense, you asked the wrong question because bam/bai are handled internally and transparently to the user - you don't need multiple outputs but you need to know something about how the bam datatype object transparently indexes itself using a converter as part of setting metadata.

Suggestion: you might save yourself a lot of time and effort by looking at existing tools that seem to do the kinds of things you want and cloning that code. You do not need or want composite files to manage (eg) bam index files because the bam object knows how to index itself. Unfortunately for programmers new to Galaxy internals but fortunately for users, the relationship between bam and bai is managed under the hood.

For example, there are existing tools (sam to bam comes to mind) that write bam files and Galaxy will take care of indexing them for you routinely as part of setting the new dataset's metadata - same with uploading a bam - no point in uploading the bai because new bams will be autoindexed when they appear in a Galaxy history. I'm not sure if you can stop that - it's designed to ensure that the bai matches Galaxy's reference genomes - which is kind of why the whole process is hidden from users and thus mysterious and not easily finagled by programmers as you're learning :)

The Galaxy generated bai file is just another file as far as Galaxy is concerned but its' path is hidden - stored as part of the bam file's metadata. It can be recovered when it's needed to be passed to a tool - it cannot be found easily otherwise. Fubar's htseq tool illustrates how to locate and pass the bai files for a bam - eg if a user has selected $bamf, then passing $bamf.metadata.bam_index to your tool will allow it to access the index - but be warned, MOST tools require the galaxy path (which ends in .dat) to be, ahem, adjusted so it sees something ending in .bai - but that's another story!

Finally but not recommended: You could write your own index to the path at $bamf.metadata.bam_index to replace the autogenerated bai if you really want, but be warned that may not work well if the index was generated with a different version of the reference containing additional contigs not present in the Galaxy reference data..

ADD COMMENT • link modified 4.2 years ago • written 4.2 years ago by fubar ♦ 1.1k

Although my situation is similar to that of BAM and BAI files, Galaxy doesn't natively handle the type of files I'm trying to work with. Essentially, I start from a reference genome in FASTA format and the tool I'm trying to adapt (Yaha, a split-read aligner) generates its own genome index. It does so by producing two files: a compressed binary reference genome (NIB2 format), and its own binary index format. As a command-line argument, you point Yaha towards the binary index file, among other input files, and it automatically looks for the NIB2 file based on the filename (i.e. it truncates the index file extension and appends ".nib2"). For the sake of simplicity, I prefer not to have to add both files to Galaxy's history and prompting the user to select both files as input for Yaha, and then creating symlinks behind the scenes to make it work. I was hoping that composite datatypes would solve my issue what do you think?

Thanks again :)

ADD REPLY • link modified 4.2 years ago • written 4.2 years ago by Marija • 20

ok - not sure why you didn't say so before now. This is not so hard but involves coming to terms with the Galaxy infrastructure for transparently (from the user's point of view) managing sets of reference fasta (genomes) and binary index files for mappers. A new mapper means a new set of indexes for each genome maintained manually until you write a new data manager (eg see https://wiki.galaxyproject.org/Admin/Tools/DataManagers) to automatically maintain them

The complexities include .loc files, data managers and the whole related infrastructure for passing the indexes that go with a specific inbuilt genome. Best way to start would be to take a careful look at the relevant tool xml, loc files and the relevant index directories to see how (for example) different sets of binary indices are maintained for the same reference fasta but made available for BWA and srma via tool-data/bowtie2_indices.loc and srma_index.loc.

Documentation is not perfect, but start here https://wiki.galaxyproject.org/Admin/DataIntegration

Good luck.

ADD REPLY • link written 4.2 years ago by fubar ♦ 1.1k

Please log in to add an answer.

Similar posts • Search »