Question: Galaxy sourcing reference data from location not in data table
0
gravatar for matthias.desmet
2.6 years ago by
Belgium
matthias.desmet150 wrote:

Hi all,

I'm having a issue with the picard tools downloaded from the toolshed. When trying to use the CollectInsertSizeMetrics tool in the picard suite, I'm asked to use a reference genome as input. This is no problem, since the tool should read the available references from the all_fasta.loc file.

However, I noticed that the tool apparently sources it's references from some other location too. This results in 2 links to the hg19 genome, which translates in a comma separated list of the input paths as arguments to the tool, which then enters an error state, as this isn't a legal argument.

Anyone any idea where Galaxy reads it's reference files from, apart from the loc files in tool-data? It seems to look for data in the folder defined in the data table but it also seems to include the ~/galaxy-dist/tool-data/hg19/seq/hg19.fa path for some reason.

This also occurs in all other picard related tools that require a reference fasta input.

Thanks! M

Extra info:

~/galaxy-dist/tool-data/all_fasta.loc:

mm10    mm10    Mouse (Mus Musculus): mm10      /Shared/references/mm10/seq/mm10.fa
danRer7 danRer7 Zebrafish (Danio rerio): danRer7        /Shared/references/danRer7/seq/danRer7.fa
hg19    hg19    Human (Homo sapiens) (b37): hg19        /Shared/references/hg19/seq/hg19.fa
hg_g1k_v37      hg_g1k_v37      Human (Homo sapiens) (b37): hg_g1k_v37  /Shared/references/hg_g1k_v37/seq/hg_g1k_v37.fa
hg38    hg38    Human (Homo sapiens) (b38): hg38        /Shared/references/hg38/seq/hg38.fa
equCab2 equCab2 Horse (Equus caballus): equCab2 /Shared/references/equCab2/seq/equCab2.fa

excerpt from the offending tool xml:

<command>
    @java_options@
    ##set up input files

    #set $reference_fasta_filename = "localref.fa"

    #if str( $reference_source.reference_source_selector ) == "history":
        ln -s "${reference_source.ref_file}" "${reference_fasta_filename}" &amp;&amp;
    #else:
        #set $reference_fasta_filename = str( $reference_source.ref_file.fields.path )
    #end if

    java -jar \$JAVA_JAR_PATH/picard.jar
    CollectInsertSizeMetrics
    INPUT="${inputFile}"
    OUTPUT="${outFile}"
    HISTOGRAM_FILE="${histFile}"
    DEVIATIONS="${deviations}"

    #if str( $hist_width ):
      HISTOGRAM_WIDTH="${hist_width}"
    #end if

    MINIMUM_PCT="${min_pct}"
    REFERENCE_SEQUENCE="${reference_fasta_filename}"
    ASSUME_SORTED="${assume_sorted}"
    METRIC_ACCUMULATION_LEVEL="${metric_accumulation_level}"

    VALIDATION_STRINGENCY="${validation_stringency}"
    QUIET=true
    VERBOSITY=ERROR

  </command>
  <inputs>
    <param format="sam,bam" name="inputFile" type="data" label="Select SAM/BAM dataset or dataset collection" help="If empty, upload or import a SAM/BAM dataset."/>
    <conditional name="reference_source">
      <param name="reference_source_selector" type="select" label="Load reference genome from">
        <option value="cached">Local cache</option>
        <option value="history">History</option>
      </param>
      <when value="cached">
        <param name="ref_file" type="select" label="Using reference genome" help="REFERENCE_SEQUENCE">
          <options from_data_table="all_fasta">
          </options>
          <validator type="no_options" message="A built-in reference genome is not available for the build associated with the selected input file"/>
        </param>
      </when>
      <when value="history">
        <param name="ref_file" type="data" format="fasta" label="Use the folloing dataset as the reference sequence" help="REFERENCE_SEQUENCE; You can upload a FASTA sequence to the history and use it as reference" />
      </when>
    </conditional>
data_tables picard • 878 views
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by matthias.desmet150
1
gravatar for matthias.desmet
2.6 years ago by
Belgium
matthias.desmet150 wrote:

Found it,

Apparently galaxy also sources loc files from the installation directory of any data managers you might have (had) installed.

In my case, I used to have data_manager_fetch_genome_all_fasta and used it to download the hg19 genome. Then we switched to manual editing of the loc files and downloading from the rsync server, but the loc file remained, resulting in a double reference with the same dbkey.

ADD COMMENTlink written 2.6 years ago by matthias.desmet150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 172 users visited in the last hour