Question: How to access datacollection metadata (input names of elements)
1
gravatar for fiebig
2.2 years ago by
fiebig30
fiebig30 wrote:

Hi,

I need help with a tool I'm working on. It takes a data collection list of bam files as input. It is part of a workflow, that starts with a list of fastq files. The .sh script contains a pipeline that enables multi-threading for samtools mpileup and bcftools. It takes the number of cores, the reference fasta for samtools and all bams from the collection as input. The initial loop is required for linking .bam and the .bai index. Thanks to Björn Grüning for that snippet.

            #for $bam in $input_bam:
                 ln -s $bam '${bam}.bam'  &&  ln -s $bam.metadata.bam_index '${bam}.bai' &&
            #end for

            gbs_pileup_bcf_parallel.sh $threads $ref "

            #for $bam in $input_bam:
                    ${bam}
            #end for

            " >${vcf_out} 2>$log

It works quite well, except for one problem: by handing the input ($bam) to bash, I loose track of the sample names, instead the path in the working directory is printed (see{}).

CHROM |POS |ID|REF| ALT| QUAL |FILTER|INFO |FORMAT |{/path/to/dataset_2416.dat} | {/path/to/dataset_2418.dat} chr1H_part2|1470552|. | G | A |75.975| . |DP=3;[...] |GT:PL:DP:DV:GQ| 1/1:108,9,0:3:3:11 | 1/1:0,0,0:0:0:4

Running the normal toolshed samtools mpileup in single thread-mode, I get the following, more convenient output (containing the sample names from the collection):

CHROM |POS |ID|REF| ALT| QUAL |FILTER|INFO |FORMAT |{ETC1_R2.mini.trim.fq}|{ETC1_R1.mini.trim.fq} chr1H_part2|1470552|. | G | A |75.975| . |DP=3;[...] |GT:PL:DP:DV:GQ| 1/1:108,9,0:3:3:11 | 1/1:0,0,0:0:0:4

I need to find a way to replace "/path/to/dataset_2416.dat" with original sample name "ETC1_R2.mini.trim.fq" as part of the collection. Since I already have to softlink .bam and .bai, I could easily do this at this stage, but I miss the commands how to address the metadata of a data collection, more specific the input name of all samples in a collection. After linking I would expect a name like "/path/to/ETC1_R2.mini.trim.fq". That would be ok for me.

Any hints? Or better ideas? ;)

Anne

ADD COMMENTlink modified 4 months ago by bio.erikson0 • written 2.2 years ago by fiebig30

Ok, problem fixed by using the Readgroup information. Nevertheless, I would like to know how to access single elements metadata of a collection. I just realised, I can't even change the names of single elements via the interface. Only the name of the collection itself can be changed. (just wan't to read the name from the collection items...not change them, of course)

Can be closed.

ADD REPLYlink written 2.2 years ago by fiebig30
0
gravatar for y.hoogstrate
2.2 years ago by
y.hoogstrate460
Netherlands
y.hoogstrate460 wrote:

I think you can access those by {$bam.name}

ADD COMMENTlink written 2.2 years ago by y.hoogstrate460

Using the {name}, the name of the job in the history was addressed. The actual softlink wasn't created - maybe the name was to long^^ Still, I wonder, how samtools can pass the input name to VCF format.

ln -s /data/filer/galaxy_DB/files/002/dataset_2517.dat 'Map with BWA-MEM on data 39 (mapped reads in BAM format).bam' && ln -s /data/filer/galaxy_DB/files/_metadata_files/000/metadata_276.dat 'Map with BWA-MEM on data 39 (mapped reads in BAM format).bai'

ADD REPLYlink written 2.2 years ago by fiebig30
0
gravatar for bio.erikson
4 months ago by
bio.erikson0 wrote:

If you want to access the name of each individual data set in the collection: ${' '.join(map(str, x.name for x in $bam)))}

ADD COMMENTlink written 4 months ago by bio.erikson0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 172 users visited in the last hour