Greetings Galaxy,
I have been attempting to parallelize some tools which use BAM files as input. The most efficient way one could implement this (i.e. avoiding unnecessary read and writes) would be to pass unique intervals of the corresponding
BAM to the command itself (i.e. samtools view $input $interval).
I have the following tool which takes an interval file and a BAM file as input:
<tool id="bam_parallelism_test" name="BAM Parallelism Test" version="1.0.0"> <description> BAM Parallelism Test </description> <parallelism method="multi" split_inputs="infile" split_mode="to_size" split_size="1" merge_outputs="outfile"></parallelism> <command> ln -s $normal input.bam; ln -s $normal.metadata.bam_index input.bam.bai; samtools view input.bam \$(cat $infile) | head -n 5 > $outfile; </command> <inputs> <param type="data" format="bam" name="normal" label="Normal Alignment BAM"/> <param type="data" format="txt" name="infile" label="Interval File"/> </inputs> <outputs> <data format="txt" name="outfile"/> </outputs> </tool>
The infile looks as follows:
1:1-50000000 1:50000001-100000000 1:100000001-150000000 1:150000001-200000000 1:200000001-249250621 2:1-50000000 2:50000001-100000000 2:100000001-150000000 2:150000001-200000000 2:200000001-243199373
The parallelism is not working.
I am running this code on an AWS cloudman instance of galaxy. Has a single master node and 5 worker nodes.
Please let me know if you see anything adherently wrong about how I am doing this. It is not working. It only prints out the first 5 lines of chromosome 1 (the first interval in the txt infile).
Thanks,