Parallelizing BAM calculations using Intervals specified in Text File

Question: Parallelizing BAM calculations using Intervals specified in Text File

3.7 years ago by

Canada

Greetings Galaxy,

I have been attempting to parallelize some tools which use BAM files as input. The most efficient way one could implement this (i.e. avoiding unnecessary read and writes) would be to pass unique intervals of the corresponding

BAM to the command itself (i.e. samtools view $input $interval).

I have the following tool which takes an interval file and a BAM file as input:

<tool id="bam_parallelism_test" name="BAM Parallelism Test" version="1.0.0">
    <description>
        BAM Parallelism Test
    </description>
    <parallelism method="multi" split_inputs="infile" split_mode="to_size" split_size="1" merge_outputs="outfile"></parallelism>
    <command>
        ln -s $normal input.bam;
        ln -s $normal.metadata.bam_index input.bam.bai;
        samtools view input.bam \$(cat $infile) | head -n 5 > $outfile;
    </command>
    <inputs>
        <param type="data" format="bam" name="normal" label="Normal Alignment BAM"/>
        <param type="data" format="txt" name="infile" label="Interval File"/>
    </inputs>
    <outputs>
        <data format="txt" name="outfile"/>
    </outputs>
</tool>

The infile looks as follows:

1:1-50000000
1:50000001-100000000
1:100000001-150000000
1:150000001-200000000
1:200000001-249250621
2:1-50000000
2:50000001-100000000
2:100000001-150000000
2:150000001-200000000
2:200000001-243199373

The parallelism is not working.

I am running this code on an AWS cloudman instance of galaxy. Has a single master node and 5 worker nodes.

Please let me know if you see anything adherently wrong about how I am doing this. It is not working. It only prints out the first 5 lines of chromosome 1 (the first interval in the txt infile).

Thanks,

parallelism txt cluster sam bam • 1.1k views

ADD COMMENT • link •

modified 3.7 years ago • written 3.7 years ago by marcoalbuquerque.sfu • 50

METHOD: 'basic' = single input and single output 'multi' = More inputs or outputs than basic, and ALL need to be accounted for in the parallelism tag. - split_inputs (those inputs you want to split) - shared_inputs (those inputs you do not want to be split, but need to be shared among all subprocesses) <- MY ERROR - merge_outputs (any files that need to be merged, I'll assume this needs to be an output)

Similar posts • Search »