Question: Problem With Depth Of Coverage On Bam Files (Gatk Tools)
0
gravatar for Lilach F
6.5 years ago by
Lilach F190
Lilach F190 wrote:
Hi, I am trying to used Depth of Coverage to see the coverages is specific intervals. The intervals were taken from UCSC (exons of 2 genes), loaded to Galaxy and the file type was changed to intervals. I gave to Depth of Coverage two BAM files (resulted from BWA, selection of only raws with the Matching pattern: XT:A:U, and then SAM-to-BAM) and the intervals file (in advanced GATK options). The consensus genome is hg_g1k_v37. I got the following error message: An error occurred running this job: *Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/space/g2main ##### ERROR ##### ERROR A USER ERROR has occurred (version 1.4-18-g80a4ce0): ##### ERROR The invalid argume *Is it a bug, or did I do anything wrong? I will be grateful for any help. Thanks! Lilach* *
bwa alignment • 3.3k views
ADD COMMENTlink modified 6.5 years ago by Jennifer Hillman Jackson25k • written 6.5 years ago by Lilach F190
0
gravatar for Jennifer Hillman Jackson
6.5 years ago by
United States
Jennifer Hillman Jackson25k wrote:
Hi Lilach, The problem with this analysis probably has to do with a mismatch between the genomes: the intervals obtained from UCSC (hg19) and the BAM from your BWA (hg_g1k_v37) run. UCSC does not contain the genome 'hg_g1k_v37' - the genome available from UCSC is 'hg19'. Even though these are technically the same human release, on a practical level, they have a different arrangement for some of the chromosomes. You can compare NBCI GRCh37 <http: www.ncbi.nlm.nih.gov="" genome="" assembly="" 2758=""/> with UCSC hg19 <http: genome.ucsc.edu=""> for an explanation. Reference genomes must be /exact/ in order to be used with tools - base for base. When they are exact, the identifier will be exact between Galaxy and the source (UCSC, Ensembl) or the full Build name will provide enough information to make a connection to NCBI or other. Sometimes genomes are similar enough that a dataset sourced from one can be used with another, if the database attribute is changed and the data from the regions that differ is removed. This may be possible in your case, only trying will let you know how difficult it actually is with your analysis. The GATK pipeline is very sensitive to exact inputs. You will need to be careful with genome database assignments, etc. Following the links on the tool forms to the GATK help pages can provide some more detail about expected inputs, if this is something that you are going to try. Good luck with the re-run! Jen Galaxy team -- Jennifer Jackson http://galaxyproject.org
ADD COMMENTlink written 6.5 years ago by Jennifer Hillman Jackson25k
I'm curious what is this genome called 'hg_g1k_v37' and how does it correspond to NCBI GRCh37 which is identical to UCSC hg19 ? --Hiram
ADD REPLYlink written 6.5 years ago by Hiram Clawson260
If hg_g1K_v37 == "1000 Genomes version of GRCh37" then it is the GRCh37 Primary assembly + a decoy sequence to try to soak up off target reads. The chromosome coordinates are the same but the sequences included in the packages are different. Here is the description from the 1000 Genomes site: http://www.1000genomes.org/category/assembly Deanna
ADD REPLYlink written 6.4 years ago by Church, Deanna (NIH/NLM/NCBI) [E]30
Hi Jennifer, Thank you for this reply. I made a new BWA file, this time using the hg19(full) genome. However, when I am trying to use DepthOfCoverage, the reference genomr is stucked on the hg_g1k_v37 (this is the only option to select), and I cannot change it to hg19(full). Most probably, because I selected hg_g1k_v37 in the previous time I tried to use DepthOfCoverage. It seems as a bug? How can I change it? Thanks, Lilach 2012/6/18 Jennifer Jackson <jen@bx.psu.edu>
ADD REPLYlink written 6.4 years ago by Lilach F190
Hi Lilach, I have been dealing with these issues for some time now. The only genome you can use with Picard and GATK tools in Galaxy is hg_g1k_v37. I think this is why. "If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT. The order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error. This draconian approach, though unnecessary technically, ensures that all supplementary data provided with the GATK works correctly. You can use ReorderSam to fix a BAM file aligned to a missorted reference sequence." [1]http://www.broadinstitute.org/gsa/wiki/index.php/Input_files_for_th e_GATK So far what I have done when presented with a BAM file produced with reference with lexicographical chromosomes ordering, is to use Picard's ReorderSam tool, also in Galaxy, selecting hg_g1k_v37 as reference. You might not be able to this, as if a recall correctly hg19 also use chr1, chr2... instead of 1, 2, ... In that case more work needs to be done and at that point is almost easier to just remap with the correct reference for use with GATK. In your case it seems you already have it. What you might need to do is resort your intervals file and probably change the chromosomes identifiers, this I think can be done inside Galaxy. I would love to hear comments about this approach, as sometime I do worry like Hiram's comment hints to, that hg19 and hg_g1k_v37 might not be completely identical beside the chromosome ordering. In that case my resorted BAM or intervals files might be incorrect. Hope it helps, Carlos
ADD REPLYlink written 6.4 years ago by Carlos Borroto390
Hi Carlos, Thank you very much for this explanation. The format of my intervals file is: chr133289059732890664NM_000059_cds_1_0_chr13_32890598_f0+chr1332893213 32893462NM_000059_cds_2_0_chr13_32893214_f0+chr133289921232899321 NM_000059_cds_3_0_chr13_32899213_f0+chr133290023732900287 NM_000059_cds_4_0_chr13_32900238_f0+etc... Can you please explain me how to change this format so I will be able to give it as an input to DepthOfCoverage Thanks, Lilach 2012/6/21 Carlos Borroto <carlos.borroto@gmail.com>
ADD REPLYlink written 6.4 years ago by Lilach F190
Hello Lilach, Currently, the human reference genome indexed for the GATK-beta tools is 'hg_g1k_v37'. The GATK-beta tools are under active revision by our team, so we expect there to be little to no change to the beta version on the main public instance until this is completed. Attempting to convert data between different builds is not recommended. These tools are very sensitive to exact inputs, which extends to naming conventions, etc. The best practice path is to start and continue an analysis project with the same exact genome build throughout. If you want to use the hg19 indexes provided by the GATK project, a cloud instance is the current option (using a hg19 genome as a 'custom genome' will exceed the processing limits available on the public Galaxy instance). Following the links on the GATK tools can provide more information about sources, including links on the GATK web site which will note the exact contents of the both of these genome versions, downloads, and other resources. Hopefully this helps to clear up any confusion, Best, Jen Galaxy team -- Jennifer Jackson http://galaxyproject.org
ADD REPLYlink written 6.4 years ago by Jennifer Hillman Jackson25k
Hi Lilach, Sorry for the late response. Jen just confirmed the disadvantages of my approach. I don't know how difficult could be for you to double check the coordinates you have in your interval file are correct for hg_g1k_v37. If you feel confident they will work and want to proceed, you could do something like this outside of galaxy, you could also I'm sure find a way to do it inside galaxy: sed 's/^chr//' interval_file.csv > interval_file_g1k.csv If you have coordinates for the mitochondrial chromosome you might have to do also: sed 's/^MT/M/' interval_file.csv > interval_file_g1k.csv As if I remember correctly UCSC uses chrMT and GATK expects just M. Please double check this as I'm not sure. It would be also nice is there were a confirmation on what exactly hg_g1k_v37 is, and where you could find annotations for it. Annotations from Ensembl would do? Regards, Carlos
ADD REPLYlink written 6.4 years ago by Carlos Borroto390
Hi Jennifer, Is there a way to directly upload my files from the public Galaxy to my cloud Galaxy instance (in AWS)? Or should I download them first to my computer, and then to upload them? (It takes a lot of time because of the low uploading speed). Thanks, Lilach 2012/6/26 Jennifer Jackson <jen@bx.psu.edu>
ADD REPLYlink written 6.4 years ago by Lilach F190
Hi Lilach, Regarding the cloud instance, you can load data from the public main instance of Galaxy just like any other URL. On the "Get Data -> Upload Data" form on your cloud instance , paste in the URLs of the datasets from main. The URL can be captured by right-clicking on a dataset's disk icon and then "Copy link location" (on a Mac; do the equivalent if using a PC). It is generally better to transfer one URL per job, if the data is large, since jobs have a certain amount of time to complete. If you lump together several large file URLs into one job, there could be a chance that it could time out. It is fine to execute several jobs concurrently. Best, Jen Galaxy team -- Jennifer Jackson http://galaxyproject.org
ADD REPLYlink written 6.4 years ago by Jennifer Hillman Jackson25k
May I join to the question of Carlos? what is exactly hg_g1k_v37? and how can I get the intervals of specific genes in this format? Thanks, Lilach 2012/6/27 Lilach Friedman <lilachfr@gmail.com>
ADD REPLYlink written 6.4 years ago by Lilach F190
Hello Lilach, The genome build 'hg_g1k_v37' is build "b37" in the GATK documentation. Hg19 is also included (as a distinct build). I encourage you to examine these if you are interested in crossing over between genomes or identifying other projects that have data based on the same genome build. http://www.broadinstitute.org/gsa/wiki/index.php/Introduction_to_the_G ATK -> http://www.broadinstitute.org/gsa/wiki/index.php/GATK_resource_bundle " GATK resource bundle: A collection of standard files for working with human resequencing data with the GATK. The standard reference sequence we use in the GATK is the the b37 edition from the Human Genome Reference Consortium. All of the key GATK data files are available against this reference sequence. Additionally, we used to use UCSC-style (chr1, not 1) for build hg18, and provide lifted-over files from b37 to hg18 for those still using those files. b37 resources: the standard data set * Reference sequence (standard 1000 Genomes fasta) along with fai and dict files <more, please="" follow="" link="" for="" details="" ...=""> hg19 resources: lifted over from b37 * Includes the UCSC-style hg19 reference along with all lifted over VCF files." Hopefully this helps, Jen Galaxy team -- Jennifer Jackson http://galaxyproject.org
ADD REPLYlink written 6.4 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 172 users visited in the last hour