Question: Parallelizing BAM calculations using Intervals specified in Text File
0
gravatar for marcoalbuquerque.sfu
3.7 years ago by
Canada
marcoalbuquerque.sfu50 wrote:

Greetings Galaxy,

I have been attempting to parallelize some tools which use BAM files as input. The most efficient way one could implement this (i.e. avoiding unnecessary read and writes) would be to pass unique intervals of the corresponding

BAM to the command itself (i.e. samtools view $input $interval).


I have the following tool which takes an interval file and a BAM file as input:

<tool id="bam_parallelism_test" name="BAM Parallelism Test" version="1.0.0">
    <description>
        BAM Parallelism Test
    </description>
    <parallelism method="multi" split_inputs="infile" split_mode="to_size" split_size="1" merge_outputs="outfile"></parallelism>
    <command>
        ln -s $normal input.bam;
        ln -s $normal.metadata.bam_index input.bam.bai;
        samtools view input.bam \$(cat $infile) | head -n 5 > $outfile;
    </command>
    <inputs>
        <param type="data" format="bam" name="normal" label="Normal Alignment BAM"/>
        <param type="data" format="txt" name="infile" label="Interval File"/>
    </inputs>
    <outputs>
        <data format="txt" name="outfile"/>
    </outputs>
</tool>

 


The infile looks as follows:

1:1-50000000
1:50000001-100000000
1:100000001-150000000
1:150000001-200000000
1:200000001-249250621
2:1-50000000
2:50000001-100000000
2:100000001-150000000
2:150000001-200000000
2:200000001-243199373

The parallelism is not working.

I am running this code on an AWS cloudman instance of galaxy. Has a single master node and 5 worker nodes.

Please let me know if you see anything adherently wrong about how I am doing this. It is not working. It only prints out the first 5 lines of chromosome 1 (the first interval in the txt infile). 

Thanks,

parallelism txt cluster sam bam • 1.1k views
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by marcoalbuquerque.sfu50
0
gravatar for marcoalbuquerque.sfu
3.7 years ago by
Canada
marcoalbuquerque.sfu50 wrote:

Parallelism (to my knowledge) works as follows:

 <parallelism method="multi" split_inputs="infile" shared_inputs="normal" split_mode="to_size" split_size="1" merge_outputs="outfile"></parallelism>

METHOD:
    'basic' = single input and single output
    'multi' = More inputs or outputs than basic, and ALL need to be accounted 
              for in the parallelism tag.
        - split_inputs (those inputs you want to split)
        - shared_inputs (those inputs you do not want to be split, 
          but need to be shared among all subprocesses) <- MY ERROR
        - merge_outputs (any files that need to be merged, I'll assume this needs to be an output)

In addition,

The following lines needed to be uncommented / added to universe_wsgi.ini* files :

use_tasked_jobs = True
local_task_queue_workers = 2

 

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by marcoalbuquerque.sfu50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour