Hello, everyone: I am building workflow for SNV on whole genome sequences. I was following best-practice for GATK. However, I am new to the field and have problem with it. I have error when I used input file created by GALAXY after steps such as map to reference, duplicate call and indel realign. The file is BAM and sequences were sorted by coordinates. Reading error description left me clueless because it is too much information in the debugger window. Could anyone to guide me through the troubleshooting and fixing the issue? Thank you in advance. PS. I am reading materials listed in the debugger and have no clarity what to do.
Hello,
As described in the error and at the GATK website resources, chromosome order is important. This is true whether working in Galaxy or line-command. My guess is that the order of some of your inputs follow the ordering guidelines but other inputs do not. All must be sorted - this includes the reference genome used in alignment and other steps, reference annotation data, aligned data (BAMs).
FAQs:
- https://biostar.usegalaxy.org/p/14777/
- https://galaxyproject.org/support/#getting-inputs-right- > https://galaxyproject.org/support/sort-your-inputs/
Small warning: the Galaxy wrapped GATK tools are at least one version old (in the Tool Shed) and even older in the deprecated set hosted at https://usegalaxy.org. These are not recommended. Changes in licensing may mean that a decision in the future to update the tools is a possible, but nothing is certain. For now, see the alternative variant analysis tools in the tutorials.
Galaxy tutorials:
Thanks! Jen, Galaxy team
Thank you for prompt reply. I see what you are talking about. Please, correct if I am wrong. I should sort in chromosome order not only files I am analyzing but all files used in the tool, including reference sequences in FASTA and BED formats. Did I get it right?
Yes, all should be the same. Consistent sort order in all inputs is required. You may need to recreate your custom genome, then start over from mapping.
This situation is somewhat common when going through these tools the first time. The prepped data from GATK is already formatted this way but doesn't cover all genomes.
In addition, thank you for the tip on alternative tools. I am going to try them.