TopHat RNA-Seq FASTQ file not seen

Question: TopHat RNA-Seq FASTQ file not seen

2.2 years ago by

r.naylor • 0 wrote:

Hey, I'm trying to use Galaxy to process my RNA-seq data.

I have uploaded the fastq file to galaxy via FTP, all seems to have worked well (the history background has gone green). But, when I come to do the RAN analysis with TopHat, I cannot input anything into the "RNA-seq FASTQ file" box. How do I do this?

Thanks

fastq tophat datatype input fastqsanger • 794 views

ADD COMMENT • link •

modified 2.2 years ago • written 2.2 years ago by r.naylor • 0

2.2 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Double check that the input datatype is a match what the tool is expecting. In this case, fastqsanger is the expected datatype.

How tools recognized datasets: https://wiki.galaxyproject.org/Support#Tool_doesn.27t_recognize_dataset

Confirming and assigning fastqsanger format: https://wiki.galaxyproject.org/Support#FASTQ_Datatype_QA

Thanks, Jen, Galaxy team

ADD COMMENT • link written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

2.2 years ago by

r.naylor • 0

r.naylor • 0 wrote:

Thank you Jen,

I managed to get the file to work, the format was incorrect as you say (though when I changed the datatype to fastqsanger it did not work, only when I changed it to gtf did it work). My problem now is that I am still geting CUFF denotions for my genes and transcripts whereas I was hoping the gene names would be used.

My annotated genome was taken from Ensembl, Danio_rerio.GRCz10.85.gtf.

Is there something that you can see I'm doing wrong here?

Cheers Richard

ADD COMMENT • link written 2.2 years ago by r.naylor • 0

If this was a reference annotation file, not fastq file as originally stated in the question, then yes, one of the expected input formats is GTF. That datatype must be assigned for the tool to use it. The expected format is noted for different inputs on the tool form.

For your other issue, one of two things could be the source of the problem:

The chromosome identifiers in the reference annotation are not a match for the chromosome identifiers in reference genome used to map the fastq data. Help to detect and correct genome mismatch problems: https://wiki.galaxyproject.org/Support#Reference_genomes
If that checks out, then the content of the annotation is probably lacking in the key attributes some RNA-seq tools use (such as Cufflinks, Cuffmerge, Cuffdiff) to generate full statistics and annotation. Look for tss_id, p_id, and gene_name in the attributes. The gene_id and transcript_id also need to be different values. Many sources are not formatted this way or are missing the attributes. If your genome is available from iGenomes, that is one appropriate source of compatible annotation: http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/#cuffdiff-input-files.

Using iGenomes might solve the genome mismatch problem as well, if the target genome is supported by them. Match the short database name in Galaxy with that noted at iGenomes to know if there is a match.

How to get iGenomes annotation data: Download the tar file locally to your computer, unpack it, then just upload the genes.gtf file to Galaxy. As a double check, compare the chromosome identifiers in the iGenomes file and compare to those in your mapped BAM/SAM data (if in BAM format, use BAM-to-SAM to extract just the header and compare in that format).

ADD REPLY • link written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

2.2 years ago by

r.naylor • 0

r.naylor • 0 wrote:

Hi Jen,

Thank you for your help. I uploaded the genes.gtf from the iGenomes zebrafish tar file, but I still am not getting not getting any gene names in my lists that are generated.

Do you think it could be chromosome identifiers you suggested I double check (I do not know how to do this).

Thanks Richard

ADD COMMENT • link written 2.2 years ago by r.naylor • 0

How to check reference genomes is in one of the links I sent, again here: https://wiki.galaxyproject.org/Support#Reference_genomes

This is important to learn how to do. For every analysis this type of mismatch issue can come up. If a mismatch, the data will be scientifically incorrect if there is a mismatch problem.

For more advice for a quick check, since I know what data you are working with, do this:

If running Tophat2 at http://usegalaxy.org, the only build from this genome available would be danRer10. Check if this is what actually occurred. The output BAM file will have danRer10 assigned as the "database" by default. If it is something else, then first map against danRer10.

The genome build name for the genes.gtf file that you retrieved should be based on danRer10. This is the link name you clicked on to download the tar file and will be in the name of the tar file itself.

If either of these is not a match, correct and re-run, potentially from the mapping step. All data must be created with respect to the same exact genome build.

Once all this is done, if there are no gene names for some, or all, then your reads (and the transcripts created from them) are not mapping to the same genome locations as the transcripts/genes annotated with gene_name in the annotation GTF file. I haven't downloaded and looked at this GTF specifically, but you can do a search/look yourself in it. In the last column of the GTF file (attributes), there should be a field like "gene_name" in order for gene names to be assigned. Test: the mapped reads BAM, the GTF created by Cufflinks, and the reference iGenomes GTF can be loaded into Trackster and visualized for overlap.

ADD REPLY • link modified 2.2 years ago • written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »