Question: GATK 2.8 Realigner Target Creator reference genome issue
1
gravatar for christophe.habib
3.3 years ago by
France
christophe.habib340 wrote:

Hello everyone,

I want to use Realigner Target Creator provided in the toolshed with the GATK version2.8. Unfortunately, I have a strange error :

"Failed runtime validation of Using reference genome (A built-in reference genome is not available for the build associated with the selected input file)"

The strange thing is that I am using hg19 as reference and when I fill the form I can actually choose hg19. So I guess it means that it exists.

When I look into the details I have this first line in the table :

  • Input parameter : Choose the source for the reference list
  • Value : Choose the source for the reference list 
  • Note for rerun : None

And lastly, if I try to rerun the job (on the failed job) I have a different error :

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/DATAS/tmp/galaxy
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.8-1-g932cd3a):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.
##### ERROR   reads contigs = [chr1, chr2, etc...]
##### ERROR ------------------------------------------------------------------------------------------

 

Does that mean that the picked up option is wrong?

Thank you for your help.

Christ

 

 

 

ADD COMMENTlink modified 21 months ago by hafiz.talhamalik0 • written 3.3 years ago by christophe.habib340
0
gravatar for Jennifer Hillman Jackson
3.3 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

If you are working at http://usegalaxy.org, then the only reference genome that will show up in the list "Using reference genome" will be hg_g1k_v37. But this also means that GATK version 1.4 is in use .. so this work is being done elsewhere?

If another public Galaxy instance, contacting the administrators is the way to resolve the problem. The version of hg19 loaded does not appear to be "GATK" sorted, given the error message.

If this is your own instance, then installing the genome, with correct sorting, will most likely resolve the problem.

Thanks, Jen, Galaxy team

ADD COMMENTlink written 3.3 years ago by Jennifer Hillman Jackson25k
0
gravatar for christophe.habib
3.3 years ago by
France
christophe.habib340 wrote:

Hi Jen,

Thank you for your answer.

What do you mean by correct sorting? I used the ""Generate GATK-sorted Picard indexes" data manager to prepare the reference genome, and I specified the location in the gatk2_picard_index.loc file.

Did I miss something ?

Christ

ADD COMMENTlink written 3.3 years ago by christophe.habib340
0
gravatar for christophe.habib
3.2 years ago by
France
christophe.habib340 wrote:

Hi Jen,

I still can't use this tool in a workflow or alone. I forgot to say that I am working on my own instance.

I have these lines written by me or by the data manager :

hg19    hg19    hg19    /home/galaxy/galaxypath/tool-data/hg19/gatk_picard_index/hg19/hg19.fa
hg38    hg38    hg38    /home/galaxy/galaxypath/tool-data/hg19/gatk_picard_index/hg38/hg38.fa

In these files :

./tool-data/toolshed.g2.bx.psu.edu/repos/iuc/gatk2/84584664264c/gatk2_picard_index.loc
./tool-data/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_gatk_picard_index_builder/700f2df51eb0/gatk_sorted_picard_index.loc

And I tried to add them in this file as well :

./tool-data/gatk2_picard_index.loc

Can you tell me the files that are supposed to be filled to have my reference genome considered as installed?

Thank you for your help

Christ

ADD COMMENTlink written 3.2 years ago by christophe.habib340

Hello, The reference genome needs to be added in using the "Create DBKey and Reference Genome " Data Manager as the first step. Make sure that the last option on that DM's form, the type of sort, is specified as "GATK". Confirm that this is successful, then run other indexes. If you already have hg19 loaded, then be sure to give this version a slightly different "dbkey" or name. 

Also avoid manually editing configuration files when using Data Managers. It is best to do all of the reference genome work line-command (not using Data Managers) or to only use Data Managers.

Hopefully this helps, Jen, Galaxy team

ADD REPLYlink written 3.2 years ago by Jennifer Hillman Jackson25k
0
gravatar for christophe.habib
3.2 years ago by
France
christophe.habib340 wrote:

Hi Jen,

Thank you for your suggestion. I followed what you suggested and created a new GATKhg19 genomes using "Create DBKey and Reference Genome" with the GATK sort option. Then I used BWA index builder and Generate GATK-sorted Picard indexes builder.

My workflow still stop on Realigner Target Creator with the very same errors than in my first message. I have to add that we have this in the log file :

INFO  11:34:10,574 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  11:34:10,576 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.8-1-g932cd3a, Compiled 2013/12/06 16:47:15 
INFO  11:34:10,576 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  11:34:10,576 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  11:34:10,580 HelpFormatter - Program Args: -T RealignerTargetCreator -o /home/galaxy/galaxy_21juil2015/galaxy/database/files/002/dataset_2608.dat --num_cpu_threads_per_data_thread 1 --num_threads 1 -R /home/galaxy/galaxy_21juil2015/galaxy/tool-data/GATKhg19/gatk_picard_index/GATKhg19/GATKhg19.fa -I /DATAS/tmp/galaxy/tmp-gatk-SqZohx/gatk_input.bam 
INFO  11:34:10,580 HelpFormatter - Date/Time: 2015/09/01 11:34:10 
INFO  11:34:10,580 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  11:34:10,580 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  11:34:11,294 GenomeAnalysisEngine - Strictness is SILENT 
INFO  11:34:11,447 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  11:34:11,461 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  11:34:11,495 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 
INFO  11:34:11,979 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: Default, provider: SunJSSE, class: sun.security.ssl.SSLContextImpl$DefaultSSLContext) 
INFO  11:34:11,979 HttpMethodDirector - Retrying request 
INFO  11:34:11,980 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: Default, provider: SunJSSE, class: sun.security.ssl.SSLContextImpl$DefaultSSLContext) 
INFO  11:34:11,980 HttpMethodDirector - Retrying request 
INFO  11:34:11,982 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: Default, provider: SunJSSE, class: sun.security.ssl.SSLContextImpl$DefaultSSLContext) 
INFO  11:34:11,982 HttpMethodDirector - Retrying request 
INFO  11:34:11,984 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: Default, provider: SunJSSE, class: sun.security.ssl.SSLContextImpl$DefaultSSLContext) 
INFO  11:34:11,984 HttpMethodDirector - Retrying request 
INFO  11:34:11,985 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: Default, provider: SunJSSE, class: sun.security.ssl.SSLContextImpl$DefaultSSLContext) 
INFO  11:34:11,985 HttpMethodDirector - Retrying request 

I don't know how to fix this, it is really annoying for me. Do you have any suggestion?

Thank you,

Christ

 

ADD COMMENTlink written 3.2 years ago by christophe.habib340
0
gravatar for christophe.habib
3.2 years ago by
France
christophe.habib340 wrote:

Is it possible that the builder for GATK is not properly preparing the genome for GATK2?

ADD COMMENTlink written 3.2 years ago by christophe.habib340
1

The format of the required indexes should be the same for both versions of GATK. The issue is most likely with which version of the initial reference genome was provided to the tool. If there are still issues, please reply to the first comment from this morning and explain what is going on in more detail. 

Manipulating the reference genome indexes manually could create problems, but you should not need to do any if the tools are used in this way: fetch genome (GATK-sort) creating a new dbkey -> run all indexes using that same dbkey

ADD REPLYlink written 3.2 years ago by Jennifer Hillman Jackson25k
0
gravatar for christophe.habib
3.2 years ago by
France
christophe.habib340 wrote:

I finally found what happened with my jobs. When I looked to the fasta files for these tools, it appeared that the ChrM was not sorted in the same way.

The issue was about the data manager for GATK. Apparently he didn't use the genome.fa with the GATK sorting options. I took the right file, GATK sorted, and I generated the .dict and .fai files. I placed them in the path written in the loc file. Now it works fine.

I don't really get why BWA data managers did used the good reference genome file while the GATK data manager did used the wrong one...

ADD COMMENTlink written 3.2 years ago by christophe.habib340

If I understand correctly, there are two versions of the genome on your instance. But these have distinct dbkeys. And all downstream Data Managers have been run or re-run using the GATK-sorted reference genome as input. Also, each was executed just once and none used the original non-GATK sorted dbkey as input. 

If this is true, then it is unclear how or in which index chrM is not sorted the same way. All would be based off of the same root dbkey that is already GATK-sorted.

Thanks for providing more details. If the exact steps to reproduce can be shared, we can try to troubleshoot. 

ADD REPLYlink written 3.2 years ago by Jennifer Hillman Jackson25k

Indeed I had "hg19" that was installed with the sorting option "as is", and I used this to create index with BWA and for GATK Picard in the first place.

Then I followed your guidance and I created dbkey "GATKhg19" with the sorting option "GATK", and I used data manager of BWA, then the GATK data manager with the GATKhg19 only.

When I used the GATK tools with GATKhg19 I had this error telling me that the sorting was not the same. Indeed the reference genome defined in the GATK loc file was not the same than the one in BWA and in "all_fasta" if you consider the order of the sequences in it.

I wonder if the GATK datamanager is suited for GATK2?

If you need more informations please let me know.

 

 

 

ADD REPLYlink written 3.2 years ago by christophe.habib340

I would like to delete my reference genome properly, to define them properly, what would be the good way to do so?

I thought I would delete all the files specified in the loc files, then erase the lines in these loc file to start again. But I'm afraid that the database might keep track of the previous "hg19".

Any suggestions?

Thanks

ADD REPLYlink written 3.2 years ago by christophe.habib340

Deleting genomes & indexes can be very complicated. There is no tool to do it automatically at present that I am aware of, although it is on the to-do list. 

How new is your instance? It might be worth it to start completely over. 

The other option is to simply use a distinct dbkey and load up the proper genomes, run indexes on them, and ignore the ones with problems. Avoid recycling dbkeys if you do this - the conflicts will appear again.

Running these tools for the first time can be a bit tricky as the order and content are so important to get right. But once you have correct genome fasta files with distinct dbkeys set up with the fetch genome Data Manager, all should go smooth when running the indexes.

I hope this helps you to make a decision about how to proceed. Jen, Galaxy team

ADD REPLYlink written 3.2 years ago by Jennifer Hillman Jackson25k
0
gravatar for christophe.habib
3.2 years ago by
France
christophe.habib340 wrote:

Well I think I should not delete anything right now since I'm not sure that I found the real issue.

I tried to launch my workflow with the GATKhg19 genome, once again the realigner target creator failed. But when I try to "run this job again" it works. I didn't change anything.

I'm lost... Could this problem be related to the workflow and not to the tools? Do I have to clean the cache and delete the content of the tmp folder to avoid conflicts

ADD COMMENTlink written 3.2 years ago by christophe.habib340
0
gravatar for christophe.habib
3.2 years ago by
France
christophe.habib340 wrote:

Hi Jen,

I come back for this very same problem. I created a new instance, with the very last version of Galaxy, and I used all the data manager properly. I mean, i used the "GATK sorting" option on the first step, and the next step refered the the previous one.

When i use GATK (here 3.4, but it does not matter), I still have the error :

##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: The contig order in reads and referenceis not the same;
##### ERROR   reads contigs = [chrM, chr1, chr2
##### ERROR   reference contigs = [chr1, chr2, chr3,

I think I found the explanation by looking the files located in tool-data/hg19, which are the files generated by the data_managers.

galaxy@rosetta:~/galaxy/tool-data/hg19$ ls -lrt */*/*
-rw-r--r-- 1 galaxy galaxy       3534 sept. 30 20:22 sam_indexes/hg19/hg19.fa.fai
lrwxrwxrwx 1 galaxy galaxy         17 sept. 30 20:22 sam_indexes/hg19/hg19.fa -> ../../seq/hg19.fa
-rw-r--r-- 1 galaxy galaxy       4035 sept. 30 21:22 bwa_mem_index/hg19/hg19.fa.ann
-rw-r--r-- 1 galaxy galaxy  784290318 sept. 30 21:22 bwa_mem_index/hg19/hg19.fa.pac
-rw-r--r-- 1 galaxy galaxy 1568580688 sept. 30 21:22 bwa_mem_index/hg19/hg19.fa.sa
-rw-r--r-- 1 galaxy galaxy       8595 sept. 30 21:22 bwa_mem_index/hg19/hg19.fa.amb
-rw-r--r-- 1 galaxy galaxy 3137161344 sept. 30 21:22 bwa_mem_index/hg19/hg19.fa.bwt
lrwxrwxrwx 1 galaxy galaxy         17 sept. 30 21:22 bwa_mem_index/hg19/hg19.fa -> ../../seq/hg19.fa
-rw-r--r-- 1 galaxy galaxy       3539 oct.   2 11:52 gatk_picard_index/hg19/hg19.fa.fai
-rw-r--r-- 1 galaxy galaxy 3199905909 oct.   2 11:52 gatk_picard_index/hg19/hg19.fa
-rw-r--r-- 1 galaxy galaxy      14735 oct.   2 11:52 gatk_picard_index/hg19/hg19.dict

As you can see, both sam indexes and BWA_mem fasta files are symbolic links to seq/hg19.fa. This fasta file is NOT gatk sorted, meanwhile gatk_picard_index/hg19.fa is gatk sorted. So I use BWA for the mapping, and then I try to use GATK, but it can't work since their reference file are differents.

I didn't try yet, but it would mean that I need to generate .fai, .dict, etc. from gatk_picard_index/hg19.fa to replace them in all these files so the reference is the same everywhere. Am i right?

Is there a simple way to correct this data manager's behaviour?

Thank you,

Christ

 

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by christophe.habib340

From my point of view, the origin of this problem would be that the data manager "Create DBKey and Reference Genome - fetching " do NOT sort the fasta file when we select GATK in the form.

I didn't notice, but now this option is called "Choose the source for the reference genome"... Does it mean that it does not sort anything and that we need to provide the GATK-sorted fasta?

ADD REPLYlink written 3.2 years ago by christophe.habib340

Hi - our team is looking into this more. We want these data managers to work without any extra steps. When installed in a more manual way, all data is based off of the GATK sorted reference genome then all indexes are derived from that. An unsorted genome is not included anywhere. This is not quite ready for a Trello ticket since it may just be a usage issue, but I do not know the solution. Instead I asked our team and bookmarked this. Ping back if no reply in a week. 

Sorry for the problems but let's work it out. Will help you and others doing the same set-up. Jen

ADD REPLYlink written 3.2 years ago by Jennifer Hillman Jackson25k
0
gravatar for christophe.habib
3.2 years ago by
France
christophe.habib340 wrote:

Hi, Thank you for your answer.

Now I have 2 solutions : 

generate all .dict, .fai, .len, etc. files, and replace all of them in my galaxy folders. I think it's going to be quite long and complicated.

delete all my reference genomes files, modify the loc files to erase them, and generate them again with the data managers, from a hg19 fasta file GATK sorted instead of the hg19 coming from UCSC. Do you think it could lead to conflict with the database?

 

ADD COMMENTlink written 3.2 years ago by christophe.habib340

Making any changes to these reference data manually is tricky. This is something even I avoid (and our team when at all possible). 

As another work-around option, maybe download the GATK sorted hg19 fasta file that you have, start up a brand new Galaxy, load that fasta file into a history, then start with the data managers from that point forward? The "genome fetching" data manager has an option for picking a genome from a history - it doesn't have to be retrieved from one of the external sources.

More later, our team member that designed the data manager functionality will hopefully be able to help determine the root problem here and offer usage advice or other. This may not be until next week.

ADD REPLYlink written 3.2 years ago by Jennifer Hillman Jackson25k

Thank you for you help.

Do you plan to work on an option to DELETE references with data managers when they were installed with them?

ADD REPLYlink written 3.2 years ago by christophe.habib340

Truthfully, I do not know. But will ask our team to consider the question and reply. Meanwhile, you can also open a Trello ticket with the enhancement request to get it in the system. If a duplicate, these can just be merged. Thanks for all of the throughtful feedback! Jen

ADD REPLYlink written 3.1 years ago by Jennifer Hillman Jackson25k
0
gravatar for christophe.habib
3.2 years ago by
France
christophe.habib340 wrote:

Hi Jen,

I followed your guidance, here is the result.

I used the option "history" for the source in the "Create DBKey and Reference Genome - fetching" data manager with a GATK sorted fasta file, with the option "GATK". But when I check in the tool-data/hg19/seq/hg19.fa, the file sorting is wrong (with chrM as the first sequence), and the tool-data/hg19/gatk_picard_index/hg19/hg19.fa is sorted properly.

Then I tried the same thing, with a good GATK sorted hg38.fa file, but i DID NOT pick the "GATK" option. I picked "as is" option. It works fine, and the sorting is right.

So my guess was right, the data manager is doing a wrong sorting when we pick the GATK option that lead to conflict between references.

I hope this will help to solve this behaviour.

Regards,

Christ

 

 

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by christophe.habib340

Thanks Christ for testing this out more. Our team will review the Data Manager sorting method and make changes as needed. Jen

ADD REPLYlink written 3.1 years ago by Jennifer Hillman Jackson25k
0
gravatar for hafiz.talhamalik
21 months ago by
hafiz.talhamalik0 wrote:

ERROR MESSAGE: Invalid command line: Malformed walker argument: Could not find walker with name: IndelRealigner

having same error again and again. i have tried updating java , also i have latest GATK 3.7. can any one help me ?

ADD COMMENTlink written 21 months ago by hafiz.talhamalik0

Hello,

GATK tools wrapped for Galaxy are known to have issues. Any could fail for a variety of reasons. Both those hosted as deprecated at Galaxy Main (http://usegalaxy.org) and those available from the Tool Shed for use in a local/cloud Galaxy (including the data manager).

There are no updates/corrections planned for (that I am aware of). Instead, we recommend using alternate variant analysis tools. For examples, please see the tutorials here: https://new.galaxyproject.org/learn/

Thanks! Jen, Galaxy team

ADD REPLYlink modified 20 months ago • written 20 months ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 182 users visited in the last hour