Question: Mpile up and bcf
0
gravatar for d.angra
3.5 years ago by
d.angra50
United Kingdom
d.angra50 wrote:

Hello

In order to discover the SNPs among my samples, I have used MPileup as a tool from SAM tools. However when I run bcf tools view to see the file, I see a tabular file with only 26 comments. When I got this problem more than once I checked the MPileup file output I saw that log file shows some 28,000 lines with the "The sequence "TR1|c0_g5_i1" not found" for all the lines. I am just wondering what is happening wrong?? But MPileup bcf format file is 2.6 GB.

Could anyone please help me find a possible source of error?

Any help is appreciated

 

Viva

 

samtools • 1.0k views
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by d.angra50
1
gravatar for d.angra
3.5 years ago by
d.angra50
United Kingdom
d.angra50 wrote:

Hello Jen

I thankyou for your help. I think I have understood why there is a mismatch. I run de novo assembly (on Vicia faba available SRA) using trinity on Indiana instance of galaxy and obtain results of transcriptome assembly as of set of 361,316 contig, they are all under an identifier beginning with TR1,TR2 etc.I then upload this file onto usegalaxy for SNP discovery. I have developed a workflow for SNP discovery. Following this workflow the first step is to align my dataset with this trinity assembled file (uploaded after de novo assembly) as a reference which I do by using BWA-MEM. I get output in BAM format and when I loading it to view in IGV I see it is a mismatch (as I have queried you earlier). However when I align them to my own reference (which is Medicago genome) I do get an output again in BAM format. When I visualize this under IGV I can see it clearly. I did FASTA manipulations to match the identifiers but this made no difference to visualization in IGV. Well this makes me think that when I align my dataset with reference I think IGV is not able to read as it takes those identifiers TR1 etc as unidentifiable. On the other hand when I align it with reference genome (Medicago) it gets identifiers like IMAG etc.. which is identified by IGV. So I think galaxy and IGV not recongnising the identifier is a problem.

I am not sure what I am thinking is perfectly fine. But I think to a great extent is a good explanation.

Could you please help me in this regard and suggest me other tools by which I can align my datasets with Trinity assembly as a reference?

 

Looking forward to  reply from you.

 

Viva

ADD COMMENTlink written 3.5 years ago by d.angra50

Hello,

Are you able to visualize the result in Trackster within Galaxy when the custom genome (from Trinity) is used? That was not clear. If the data can be visualized in Galaxy, but is still problematic with IGV, contacting their support team for advice is probably the best path.

The identifier format should not be a factor in any bioinformatics tool as long as the custom genome is in fasta format. But format details can matter. Sometimes it helps to remove description lines. Wrapping the genome is also often needed by downstream tools (but often not mapping tools). The troubleshooting link I gave earlier can help to perform these manipulations. If you alter the identifiers, then all steps from mapping forward need to be re-run.

If still having problems, you could post a snippet of the fasta identifiers (not just list, but actual lines from the fasta file), like:

>identifier  optional_description
ACGATCGATCGTCATCGTCGATAGCATCGATCGACTACT
CGATAGCTGATCGTCGTACGTATACTCGATCATCGACTG
CATCAGTCAATCGGCCATCGACCATCGA[...]

Then a few lines from the BAM dataset (after converting to SAM).

Make certain that the content of each is related to the same "chromosome" (e.g. assembled transcriptome sequence).

Thanks, Jen, Galaxy team

ADD REPLYlink written 3.5 years ago by Jennifer Hillman Jackson25k

Hello Jen

I appreciate your reply. I understand now.

Thanks

Viva

ADD REPLYlink written 3.5 years ago by d.angra50
0
gravatar for Jennifer Hillman Jackson
3.5 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

Check the chromosome identifiers in all of the inputs. Do these identifiers match - exactly

The pipe ( "|" ) shouldn't be an issue, but each tool is different. If nothing else works, try removing it and run the workflow again. Custom reference genomes work best with simple identifiers and no description lines in the fasta file. See the troubleshooting help here for how to adjust fasta files within Galaxy itself.

I didn't ask which server you are working on, but that is likely not a factor, based on the given information.

Thanks, Jen, Galaxy team

ADD COMMENTlink written 3.5 years ago by Jennifer Hillman Jackson25k
0
gravatar for d.angra
3.5 years ago by
d.angra50
United Kingdom
d.angra50 wrote:

Hi Jen,

Thankyou. I have done fasta manipulation on my reference data, after few step of processing then I moved on to doing BWA-MEM alignment on my datasets. This job created BAM file, However when I tried to visualize it on the IGV (which mad my reference sequence on) I get this error

       File: C:\Users\ Downloads\Galaxy30-[BWA-MEM_on_data_10_and_data_29_(mapped_reads_in_BAM_format)].bam
does not contain any sequence names which match the current genome.

File:      TR1|c0_g1_i1, TR1|c0_g2_i1, TR1|c0_g3_i1, TR1|c0_g4_i1, ...
Genome: IMGA|contig_65682_1.1, IMGA|contig_52881_1.1, IMGA|contig_52881_3.1, IMGA|contig_65138_2.1, ...     

 

I am not able to understand what this means and why am I getting this error.Could you please help me? Also I changed the format to BED file and tried to visualise it on IGV. This was loaded with no error on IGV but I can not see anything on the screen. I am bit worried about all this.

I am looking forward to find a solution to this so that I can proceed ahead with my analyses.

Thanks

Viva

 

 

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by d.angra50

Hello Viva, The sequence identifiers are a mismatch between the BAM dataset and IGV. Identifiers == names. Make sure that the same reference genome is used for all steps and that the formatting is correct before the very first step that uses it (generally mapping). In particular, review the link I included in the original reply, this line: Do these identifiers match - exactly

Hopefully you will be able to sort this out. Not much can be done with data containing conflicting identifiers versus the reference genome/annotation datasets used with tools (Galaxy or otherwise). Best, Jen, Galaxy team

ADD REPLYlink written 3.5 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 175 users visited in the last hour