Partial alignment statistics

Question: Partial alignment statistics

4.3 years ago by

raitis • 20

Latvia

raitis • 20 wrote:

Hello,

I started using Galaxy some two weeks ago. I am not bioinformatician by training, but self tought molecular biologist working with HTS. Data I have come from semiconductor sequencing (PGM and Proton).

Managed to get data in Galaxy, complete alignment, get some QC and summary statistics and so on.

My current problem is slicing aligned human genome (low coverage) AND getting summary statistics for EACH slice (coverage, GC content, number of aligned reads). Managed to get sliced BAM with BEDtools but with statistics for EACH slice goes poorly.

Also found advice by Jennifer Hillman Jackson

"Hello Els, Have you seen the tool "BEDTools -> Create a BedGraph of genome coverage"? This would give you the coverage numbers, then you could perform statistics on those numbers. You could also "Convert from BAM to BED" (there is an option to split for spliced alignments) and if you had a bed file of transcripts, use tools in this group or tools in "Operate on Genomic Intervals" to generate statistics. You could also create your own statistics using "Text Manipulation -> Compute" or "Join, Subtract and Group -> Group". Hopefully one of these options works out for you. Jen Galaxy team"

But could not follow it, probably need more detailed explanation :(

Could anyone please give me more explanation how to achieve statistics for many small parts of genome? Step by step preferred :)

Currently I have working bed file with 1mil base slices. Looking forward to make slices in order of 20-100kb. Also working (at least I think its working sliced BAM).

Thanks,

slicing coverage statistics • 1.7k views

ADD COMMENT • link •

modified 4.3 years ago by Jennifer Hillman Jackson ♦ 25k • written 4.3 years ago by raitis • 20

4.3 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Very glad you found our forum!

As a general explanation: there are often many ways to go about generating statistics. Some of the tools mentioned in that original post are designed to perform one specific task. They can be combined to create custom analysis paths. Some allow you to enter in your own calculations. But there are others that compute pre-defined statistics: search in the tool panel for the keyword "statistics".

Those that work with sequence data (such as counting GC content, etc.) may require either fasta or fastq input (meaning: the actual base sequence string for the region). Or, you can use the coordinates to compare to pre-computed summary statistics (imported tracks) from sources like UCSC: the track "GC Percent" is an option. Data in fasta format can be converted to fastq with the tool 'Combine FASTA and QUAL into FASTQ'. Coordinates can also be used to extract sequence using the "Extract Genomic DNA" tool. If you are using a Custom reference genome, but sure to create/assign a "Custom build", as described here:
http://wiki.galaxyproject.org/Learn/CustomGenomes

My recommendation is to review the tools (help is on each) and test out how they work. Be sure to save your final manipulations into a workflow for future use:
http://wiki.galaxyproject.org/Learn/AdvancedWorkflow -> See 'Extract Workflow from a History"

For basic file manipulation and other introduction help, please see these resources. Many more are in our wiki under "Learn" and in "Shared Data -> Published Pages" on the public server.

Basic:
1. Galaxy 101: The first thing you need to try
  http://usegalaxy.org/u/aun1/p/galaxy101
2. Or Using Galaxy 2012, Basic Protocol 1: Finding Human Coding Exons with Highest SNP Density
  http://usegalaxy.org/u/galaxyproject/p/using-galaxy-2012
Encouraged:
1. Intro Videos
  http://vimeo.com/channels/usegalaxy
2. Galaxy 101 NGS: Introduction to Polymorphism Detection via Variant Analysis
  http://usegalaxy.org/u/galaxyproject/p/galaxy-101-ngs-variant

Hopefully this helps to get you oriented, Jen, Galaxy team

ADD COMMENT • link written 4.3 years ago by Jennifer Hillman Jackson ♦ 25k

4.3 years ago by

raitis • 20

Latvia

raitis • 20 wrote:

Hi,

Thank you for welcome and the answer.

I have read available help documentation. Custom genome (spliced human genome) does not suit me because I will lose alignment at breakpoints (where read is located on border of splices, parts or bins whatever you call them). And as parts get smaller more information is lost.

I was hoping that SAMTools slice BAM could be answer for my problem - aligned BAM file is sliced by regions provided. But as description says it is for making custom BAM file to exclude unwanted regions. Could be useful someday.

Found software called Qualimap, that supposedly does the job, but could not get the result I intended.

Maybe I should ask differently - is it possible (and how) to cut out 10 000 random regions from human genome and do quality statistics on each of it?

ADD COMMENT • link modified 4.3 years ago • written 4.3 years ago by raitis • 20

4.3 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Custom genomes can be created from any fasta file, but are generally complete genomes. Or you can use the full hg19 genome available on the Main instance to extract sequence from (using a BED file).

To perform the operation you describe, obtain (instructions below) or create a BED file that contains the entire hg19 genome. One line per chromosome. Starting with that tabular dataset in interval/bed format representing each chromosome, use the tool "Make windows" to create lines that represent the size of the genomic regions you want in each random region. Next, to select random regions from this dataset, use the tool "Select random lines from a file". Finally, you can extract the actual sequence using this dataset and the tool "Extract Genomic DNA" and run statistics on the actual sequence data, or just use the interval coordinates directly to pull out information from tracks that already summarize the features/statistics of interest.

UCSC (http://genome.ucsc.edu) has many tracks to compare against (not just GC percent). Browse there first to review the available tracks and descriptions, then decide which to import/use. Some may be too large to import directly for the entire genome with the "Get data -> UCSC Main" tool that accesses the UCSC Table Browser (this has a data transfer limit), but you can also use the tool "Get Data -> Upload File" to import data from their downloads server using the URL of the table/track (no need to actually download the file, then upload it via FTP).

For the starting full coverage chromosome dataset for any genome at UCSC, you can import the "chromInfo" table from the UCSC Table Browser. Once in Galaxy, use the tool "Cut" to create a properly formatted BED file. Interval is often fine with tools, but not "Extract Genomic DNA - this requires strict bed format, even if the dataset is labeled as interval. Find this table by navigating to the genome of interest (for example, hg19), select group = all tables, then the table "chromInfo" from the list. These are small and can be imported through the "UCSC Main" tool.

File format help is here: http://wiki.galaxyproject.org/Learn/Datatypes

Best, Jen, Galaxy team

ADD COMMENT • link modified 4.3 years ago • written 4.3 years ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »