Question: How to Realign Indels and logic of workflow
gravatar for marcosp
3.1 years ago by
United States
marcosp0 wrote:

I have paired end reads of a trio that I eventually need to call variants on (make a .vcf). So I'm experiencing the same thing. Please tell me if my logic is correct:

FASTQ->FASTQC->BWA-MEM (Read group ID assigned as part of this step) -> Sort with SAM Tools -> Filter -> Clean ->Mark Duplicates -> Realign Indels (I think I have to import hg19 or b37 from Shared Data ->Data Libraries -> GATK -> hg19 (import to current history)).

Though I currently have it set to History under reference list, this message still comes up: History does not include a dataset of the required format / build.There is also this section (Binding for reference-ordered data) which I'm not sure what to do with, I'm assuming setting it to indels.

Also, when moving processes forward, should it be the BAM-MEM files only for each step, or the descendent process (i.e. Sorted BAM to filtering  rather than the BAM-MEM). Please let me know if you need anymore information to answer my questions. This is a project due soon...

ADD COMMENTlink modified 3.1 years ago by Jennifer Hillman Jackson25k • written 3.1 years ago by marcosp0
gravatar for Jennifer Hillman Jackson
3.1 years ago by
United States
Jennifer Hillman Jackson25k wrote:


If you use hg_g1k_v37 as the reference genome, instead of hg19, then you can use the built-in indexes with GATK. Whichever you choose, be sure to use the exact same reference genome for all steps. This includes data that is based on a particular build, for example VCF reference annotation (aka "reference-ordered data").

But hg19 can be used, just be aware that using a custom reference genome of this size with some GATK tools can run out of memory resource on If this occurs, move to a cloud Galaxy or a local with sufficient resources.

Other than that, the workflow looks fine. The outputs of one step are the inputs to the next. When using an indexed reference genome (hg_g1k_v37), the input dataset must have the same genome assigned as the "database" attribute. When using a genome from the history, the datatype (format) needs to be and assigned as "fasta".

Best, Jen, Galaxy team


ADD COMMENTlink written 3.1 years ago by Jennifer Hillman Jackson25k

Thank you! Another question, I have heard of people merging their data. Is this specific to multiple reads on one sample, or should it be done on all samples (i.e. a trio being analyzed for variants). In a Galaxy tutorial, the person merged all of the samples after mapping them, but I don't know if that needs to happen right away. I am currently doing the later processes on a per sample basis by running multiple files in a process simultaneously, but they are still separate files for each sample. I'm not sure if it is something I should do now, or if I can do it at a later process.

ADD REPLYlink written 3.1 years ago by marcosp0

The important part is to merge the data before variant calling. It doesn't always matter when this is done in early steps (which order). That in mind, smaller files are easier to process through many tools due to memory requirements. Just be sure to assign read groups before combining - this assignment can be done right after mapping or with the mapping tool itself (if the input is in fastq format) so that read group is assigned to specific BAM/SAM data and the samples they represent.

If you have more new questions, please send these in as new posts. Thanks! Jen

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 172 users visited in the last hour