Question: Trouble with samtools mpileup when I supply my own reference genome
1
gravatar for cmarcho
4.4 years ago by
cmarcho10
Greece
cmarcho10 wrote:

I've been working on Galaxy to run samtools mpileup function. When I use the supplied reference genome (in my case, mouse mm10), everything runs smoothly and the output bcf works great from my downstream applications.

However, when I try to run the mpileup with a reference genome I provide, I get a bcf that seems to be the appropriate size, but when I convert it to a vcf, I only get the header. I am not sure if I am messing up something with the bcf-->vcf conversion (it works fine with my other samples) or if it is something with me supplying my own reference.

My reference is recognized by Galaxy from my history (it is a .fa). I also added an index file for the reference to my history in case that magically changed something. I have also used my supplied reference for other applications and it worked fine.

If anyone has any suggestions, I would really appreciated them!

Thanks!

rna-seq mpileup samtools • 1.9k views
ADD COMMENTlink modified 4.4 years ago by Jennifer Hillman Jackson25k • written 4.4 years ago by cmarcho10
2
gravatar for Jennifer Hillman Jackson
4.4 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

As a test to see if the run produces any useful output with the given parameters, have you tried to use the tool with the alternate output type (mpileup?). This is in plain text (not binary).

If you are using the advanced option "List of regions or sites on which to operate:", make sure that the identifiers in that reference file exactly match those in your reference genome fasta file. It must be an exact match for the identifier. This is a common issue - the file will contain an identifier that is contained within the longer chromosome title ">" line of the reference genome.

Although, I am wondering if there isn't a reference genome identifier mismatch problem with the conversion. Or, for a technical reason, full title lines can sometimes be problematic with custom genomes. Try using the reference genome, in fasta format, with only identifiers (no description content). This means that everything after the first whitespace in the title ">" lines is removed. (The final identifier content must still match any reference files included, as above).

You can do this easily in the Galaxy UI, tools are in the group Fasta Manipulation:

1. Convert fasta -> tabular, breaking the title line into 2 fields
2. Convert tabular -> fasta, choosing the 1st and 3rd column (leaving the 2nd, the description, behind)
3. Wrap the fasta lines, 60 is a good choice, but anything between 40-80 is okay

More about reference genomes and troubleshooting: 
https://wiki.galaxyproject.org/Support#Reference_genomes

Usually a minor change in format resolves these sorts of issues. But once format checks out, examine data content and tool settings, to make sure that the data will pass through any minimum criteria set.

Hopefully this helps, Jen, Galaxy team

ADD COMMENTlink written 4.4 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 168 users visited in the last hour