Pileup file using NCBI-SRA toolkit

Question: Pileup file using NCBI-SRA toolkit

7 months ago by

ppurkayastha2010 • 30

ppurkayastha2010 • 30 wrote:

I have a pileup format file for a particular accession SRR1011475.

This pileup format file was obtained using NCBI sratoolkit.

MT_H37RV_BRD_V5 750 C   1   ^~, #
MT_H37RV_BRD_V5 751 C   1   ,   #
MT_H37RV_BRD_V5 752 G   1   ,   #
MT_H37RV_BRD_V5 753 C   1   ,   #
MT_H37RV_BRD_V5 754 G   1   ,   #

But I am unable to figure out which reference sequence was used to build up this pileup format file?

When I downloading the bam file for this particular accession SRR1011475 and try to generate the pileup file format using [Generate pileup from BAM dataset (Galaxy Version 1.1.2), using samtool], with my reference then I get N bases.

MT_H37RV_BRD_V5 750 N   1   ^~c #   ~
MT_H37RV_BRD_V5 751 N   1   c   #   ~
MT_H37RV_BRD_V5 752 N   1   g   #   ~
MT_H37RV_BRD_V5 753 N   1   c   #   ~
MT_H37RV_BRD_V5 754 N   1   g   #   ~

why is this difference coming in both the files? please suggest a way

ngs sratoolkit pileup galaxy • 301 views

ADD COMMENT • link •

modified 7 months ago by Jennifer Hillman Jackson ♦ 25k • written 7 months ago by ppurkayastha2010 • 30

7 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

For your workflow, a step is missing (genome mapping). BAM datasets from this source contain reads only - no mapping results. Some Galaxy wrapped tools accept BAM sequence inputs but most do not. Download the Fastq data instead directly with the tool Download and Extract Reads in FASTA/Q format from NCBI SRA, map yourself with BWA MEM -- using a custom genome as needed, then run Mpileup.

Support FAQs: https://galaxyproject.org/support/#getting-inputs-right

Galaxy tutorials: https://galaxyproject.org/learn/

For the data from NCBI: You could ask NCBI to be certain, but the genome the data was mapped to is probably the one mentioned on the Sample description form.

Be aware that if you are not using that same exact genome for mapping, and the same exact tool versions/parameters (whatever tools they used to map with and generate the pileup data), the results will differ. I didn't find the processing description at NCBI for this one with a quick search, but you could review the study and any related publications in more detail. Or ask the authors about the workflow/tools used. That is the only way to truly replicate any experiment. This is one of many good reasons to do the data processing yourself in Galaxy -- you'll have a complete history of what tools/parameters were used, be able to tune parameters to get the best result, and can extract what you did into a workflow for later reuse or to share it with others.

Thanks, Jen, Galaxy team

ADD COMMENT • link modified 7 months ago • written 7 months ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »