HISAT2 and custom reference genome

Question: HISAT2 and custom reference genome

15 months ago by

Barcelona/PRBB

Joan Gibert • 20 wrote:

Hi!

I want to perform HISAT2 aligment in galaxy using a custom genome. In this case I want to use 3'UTR genome from hg38. I obtain it from Table Browser by generating a custom track with 3'UTR of the hg38 and sending the custom track to Galaxy in fasta format.

When I run HISAT2 in Galaxy, it didn't report any error but, when I check the output file rather than appear the summary of the aligments appears the following:

Building DifferenceCoverSample 

Building sPrime 

Building sPrimeOrder 

V-Sorting samples 

V-Sorting samples time: 00:00:03 

Allocating rank array Ranking v-sort output 

Ranking v-sort output time: 00:00:01 

Invoking Larsson-Sadakane on ranks I

Then, if I check the aligment in Genome Browser it is completely empty. I guess that the problem is with the reference genome that i'm using. Any advice?

Thanks! Joan

P.S.: Forgot to mention that I tried to follow the protocol to build a custom genome from here: https://galaxyproject.org/learn/custom-genomes/

The only thing that I'm not sure how to do is the point about the chromosome identifiers ("Make sure the chromosome identifiers are a match for other inputs")

rna-seq • 920 views

ADD COMMENT • link •

modified 15 months ago • written 15 months ago by Joan Gibert • 20

15 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

I do not see any HISAT2 alignment results that were run against the 3' UTR genome extracted from UCSC (although I see those fasta datasets in one of your histories). Was the job run somewhere else? A first guess is that the tool ran out of memory while executing the job. The custom genome is quite large. Ask the administrator where you are working if more resources can be allocated to the tool. This job would fail at http://usegalaxy.org for memory resources and the error message would state that (it is how our cluster is configured, but the configuration can vary between Galaxy servers).

In the context of a custom genome, the "chromosome" identifiers are the identifiers in the fasta dataset. A "custom genome" can also be a "custom transcriptome" or really any other fasta dataset that fits the analysis. The identifiers for all are the content of the ">" title line from the start until the first whitespace. Any title line content after the first whitespace is considered description content. In nearly all cases, it is important to retain only the identifiers and get rid of the description content (NormalizeFasta can be used, as described in the FAQ). This allows the data to be matched up with other inputs - for example, a reference GTF file. The identifiers in the GTF (the first column) should be an exact match for the identifiers in the custom genome or there will be mismatch problems and tools will fail or produce odd output.

I see that you have run some mapping jobs against a full native index that has reference annotation details (BED, GTF) available from UCSC and other sources like iGenomes. You can infer if hits to the genome overlap with 3' UTR regions with that mapping. Tools in the groups Operate on Genomic Intervals, BAMTools, and many others have methods for directly finding overlap between hits and reference annotation. Which to use depends on what kind of output you want and what downstream tools you intend to use.

Galaxy analysis tutorials are here for reference: https://galaxyproject.org/learn/

Thanks, Jen, Galaxy team

ADD COMMENT • link written 15 months ago by Jennifer Hillman Jackson ♦ 25k

15 months ago by

Joan Gibert • 20

Barcelona/PRBB

Joan Gibert • 20 wrote:

Hello Jen!

I performed a HISAT aligment with a 3'UTR genome (using NormalizeFasta output) in the "main" Galaxy server, for you to check. It appeared again the code that I wrote in the previous post, but it does not seem an error to me. I guess that it is a kind of error...

Anyway, my idea after the aligment is to use featureCounts (or HTseq) to count only the reads that are in the 3'UTR and not in the CDS. The protein that I work with is an RNA bindnig protein that recognizes a concrete sequence in the 3'UTR or the mRNA. The experiment is an IP for the specific protein in a crosslinked context, where the RNA and the protein are IPed together. In the methodology we always have reads in de CDS (is a matter of the crosslinking, not a direct binding) and I want to discern if I could get rid of the reads in the CDS, the targets that I will obtain may be "cleaner" than with the whole genome.

So, with this, what do you think I should do? Do you think that I could keep only the 3'UTR reads after the mapping with some of the tools that are in the Operate on Genomic Intervals or BAMTools? Any hint? :) If so, I will take a look at the tutorials in order to see how can I perform this.

Thanks! Joan

ADD COMMENT • link written 15 months ago by Joan Gibert • 20

FeatureCounts can be used with an input reference dataset that contains just the 3' UTRs. Use the output option for "default" (3rd option listed). This will count up reads for just those regions in the annotation. If that is your goal, then it is done.

If you want to find out which sequences exactly where hitting to these regions (or not), try the tool NGS: Peak Calling > BAM filter. This can filter the BAM results to include/exclude hits that overlap with a given BED dataset. This can be used in downstream analysis as-is, or from there you can extract the fastq sequences from the filtered BAM file with NGS: Picard > SamToFastq (this tool accepts a sorted BAM file as input).

There are a few ways to do this, but the above is the most direct method I can think of with your given goals.

ADD REPLY • link written 15 months ago by Jennifer Hillman Jackson ♦ 25k

Update - I just checked your HISAT2 mapping results against the custom 3' UTR genome. It looks to be successful, so that is good. The trouble going forward is that both Featurecounts and HTseq count will need a reference annotation dataset to count up reads per region, and you won't have one based on the custom genome used (this is where "identifiers" come into play - the reference annotation and the mapping results must have the same "identifiers" for where the hits are located). However, this mapping does isolate which sequences hit the target regions.

ADD REPLY • link written 15 months ago by Jennifer Hillman Jackson ♦ 25k

Thanks for the information! However, I have couple of issues...

First, you comment that HISAT2 mapping worked but when I tried to check the reads in UCSC browser, nothing appears. Is there any manipulation that I should perform to the sample to see it in the browser? I will fight a bit with the identifiers issue and see how I can manage to perform the quantification.

Second, I tried to see if I could perform fC with the hg38 whole genome bam file. It seems that worked using a GTF file of the 3'UTR. Should I perform any formatting of this file? If I change the gene_id and keep only the NM there would e more than one NM per gene, right?

Thanks!

ADD REPLY • link written 15 months ago by Joan Gibert • 20

Please log in to add an answer.

Similar posts • Search »