Problem with bam files for exome sequencing analysis

Question: Problem with bam files for exome sequencing analysis

2.4 years ago by

nadja.chmel • 10 wrote:

Hello everyone,

I have one problem at the very beginning of the exome sequencing analysis, probably something wrong with the data format. In general, I have a patient with an unknown mutation and I want to compare the data with the parents to get some candidate genes. So far so good. For Whole Exome Sequencing a company was instructed and they have sent us the BAM files. I have tried to follow the described plan for exome sequencing. Therefore, I want to use the Free Bayes but it did not work, even with the others. Every time the following error indication appeared: Sequences are not currently available for the specified build. Afterwards, I tried to go one step back and convert the bam files into sam files. Fortunately, this worked. But it is not possible to convert this sam files again into bam files. Again the error indication appears as before. Has someone an idea what’s wrong with this bam files and how I can solve this problem? I had to upload the files with ftp because of the size, could this be a problem? I am very unexperienced in this field and deeply grateful for every hint.

Thanks a lot and best regards, Nadja

ADD COMMENT • link •

modified 2.4 years ago • written 2.4 years ago by nadja.chmel • 10

Hi Nadja

I presume this data is on usegalaxy.org? If it is, could you please run the BAM files through the Flagstat tool and report on the statistics? And which species are these reads from?

ADD REPLY • link written 2.4 years ago by Peter van Heusden • 150

Thank you very much. Yes, the data is on usegalaxy.org. I will try this Flagstat tool now. These reads are from human.

ADD REPLY • link written 2.4 years ago by nadja.chmel • 10

This is the result for my first BAM file with the flagstat tool.

57702389 + 0 in total (QC-passed reads + QC-failed reads) 402873 + 0 secondary 0 + 0 supplementary 8685951 + 0 duplicates 57298398 + 0 mapped (99.30%:-nan%) 57299516 + 0 paired in sequencing

Well, I have no experience if this data is good or bad. I have to check this now. Does this give some insights if the BAM files should work for the other tools (freebayes, etc)?

ADD REPLY • link written 2.4 years ago by nadja.chmel • 10

The data here are from an intact BAM file and represent well mapped (likely filtered) results for single-end sequencing.

The problem is with the database/build assignment or possibly a sorting issue. See comment below for details.

ADD REPLY • link written 2.4 years ago by Jennifer Hillman Jackson ♦ 25k

What build was used for the BAM files? Depending on the tools I try to use, I sometimes run into a problem if I am using hg19 but hg18 usually works.

ADD REPLY • link written 2.4 years ago by bryantl • 90

Thank you for your answer. Unfortunately, I do not know which build was used for the BAM files, because we instructed a company. But I will ask them. I tried the hg19 and the hg18 but both did not work.

ADD REPLY • link written 2.4 years ago by nadja.chmel • 10

So the right build should be the hg19

ADD REPLY • link written 2.4 years ago by nadja.chmel • 10

Hi Nadja, See my comment below for how to confirm that hg19 is actually represented in the BAM dataset exactly how released from UCSC (otherwise this could be a genome build mismatch problem). Jen, Galaxy team

ADD REPLY • link written 2.4 years ago by Jennifer Hillman Jackson ♦ 25k

2.4 years ago by

bryantl • 90

United States

bryantl • 90 wrote:

From your answers it looks like you do not have a database set for your BAM file. In order for any of the tools to process your data, you will have to indicate which build was used to make it. You can check whether or not the build is indicated on your file in galaxy by clicking on the file in your history. This should expand it and show the file size, file type and database. If the database is a question mark you will have to edit attributes (click on the pencil icon) and input what build was used.

ADD COMMENT • link written 2.4 years ago by bryantl • 90

2.4 years ago by

nadja.chmel • 10

nadja.chmel • 10 wrote:

I have checked the database set for my BAM file and it is the homo sapiens b37 and it accords with the information of the company.

ADD COMMENT • link written 2.4 years ago by nadja.chmel • 10

Is your BAM file indexed? I guess if you index your BAM file this will be sorted!

ADD REPLY • link written 2.4 years ago by reza.jabal • 0

Yes, it is indexed..

ADD REPLY • link written 2.3 years ago by nadja.chmel • 10

BAM datasets are indexed and sorted when uploaded.

Please see my other reply to this thread for help to resolve the problem from here. A genome mismatch problem is almost certainly part of the issue, and correcting that could resolve it completely.

Thanks! Jen, Galaxy team

ADD REPLY • link written 2.3 years ago by Jennifer Hillman Jackson ♦ 25k

2.4 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Bit more help:

BAM datasets are indexed and sorted when uploaded.

Try assigning the database metadata again (pencil icon) directly to the datasets, if not already, as some tools do require that during execution. If tools fail after that, it could be a sorting issue if other tools were run between upload and the failed runs (how to fix: https://github.com/jennaj/support-known-issues/wiki/Sort-your-inputs) or it may be that the wrong database is assigned (data are not actually from hg19).

It is very important to use the same exact genome build for all steps in an analysis. It is also possible to double-check yourself to see if the build assigned is really the correct one to match the data. The wrong build assignment can create an array of failure types or simply poor result data (poor in content, not necessarily a red job failure or an empty result).

I suggest confirming the build, even if the 3rd party states that hg19 was used. The release "homo sapiens b37" could also be a match for the genome build "hg_g1k_b37" - or even others not available in Galaxy but that could be used as a Custom reference genome (Ensembl, etc.)

The data must be exactly how distributed from UCSC to assign "hg19", or there could be problems.

This is how to check for a mismatch: https://wiki.galaxyproject.org/Support#Reference_genomes

This is how to use a Custom Reference genome (if needed): https://wiki.galaxyproject.org/Support#Custom_reference_genome).

Hope this gets resolved! If you get stuck, a bug report can be sent in from one of the red datasets - if working at http://usegalaxy.org or the problem can be reproduced there.

ADD COMMENT • link modified 2.4 years ago • written 2.4 years ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »