Question: How to create GFF file for HT seq
1
gravatar for shraddha.adamane
22 months ago by
shraddha.adamane10 wrote:

Hello there, I am dealing with some RNA seq data and was trying to run the raw data in bam file format thorugh HTseq. But it demands for GFF file format. So I wanted to know: What is GFF file and why do I need to create it? How am I suppose to create it? How to know if I have correctly created and ran that file?

It will be really great if you could let me know about it.

Thanks, Shradha

rna-seq bam • 1.3k views
ADD COMMENTlink modified 5 months ago by Joey Zou0 • written 22 months ago by shraddha.adamane10
1

The appropriate GTF file to use depends on the genome used to create the BAM file. What species did you align against and what version of the genome?

ADD REPLYlink written 22 months ago by Devon Ryan1.9k
1
gravatar for fate.gh
21 months ago by
fate.gh10
fate.gh10 wrote:

I'm not very professional but as I know from HTSeq 0.6.1p2 documentation , In GTF files generated by the Table Browser function of the UCSC Genome Browser, the gene_id attribute incorrectly contains the same value as the transcript_id attribute and hence a different value for each transcript of the same gene. Hence, if a read maps to an exon shared by several transcripts of the same gene, this will appear to htseq-count as and overlap with several genes. Therefore, these GTF files cannot be used as is. Either correct the incorrect gene_id attributes with a suitable script, or use a GTF file from a different source.

For the reason above, maybe it's not appropriate to use GTF file from UCSC.

Maybe using GFT files from Gencode works better.

ADD COMMENTlink modified 21 months ago • written 21 months ago by fate.gh10
0
gravatar for shraddha.adamane
22 months ago by
shraddha.adamane10 wrote:

Hi Devon Ryan, I guess it has been aligned to hg19 (Human Genome version 19).

ADD COMMENTlink written 22 months ago by shraddha.adamane10

If UCSC chromosome names were used (e.g., "chr1" and "chr2"), then get your GTF file from UCSC via the table browser (unless this is available under "shared data" in Galaxy, in which case just use it). If you instead used Ensembl chromosome names (e.g., 1 and 2), you can download the a GTF file from Ensembl.

ADD REPLYlink written 22 months ago by Devon Ryan1.9k
0
gravatar for shraddha.adamane
22 months ago by
shraddha.adamane10 wrote:

Hello Devon Ryan. I have got the GFF file. But I am not getting how to upload it from my computer into the RNAanalysis--> HTseq--> GFF file folder. It will be great if you could let me know this. Thanks

ADD COMMENTlink written 22 months ago by shraddha.adamane10

Upload it to Galaxy so it's in the same history. Then it'll be selectable.

ADD REPLYlink written 22 months ago by Devon Ryan1.9k

hi Devon Ryan. I was finally able to upload the GTF file. But I am getting error, which is as follow:

"Fatal error: Unknown error occured 0 GFF lines processed. Warning: No features of type 'exon' found. Error occured when processing SAM input (record #583 in file /galaxy-repl/main/files/016/925/dataset_16925703.dat): tid -1 out of range 0<=tid<25 [Ex"

It will be great if you could help explain me the error and solution for the same..

ADD REPLYlink written 22 months ago by shraddha.adamane10

Can you post a link to where you got the GFF/GTF file?

ADD REPLYlink written 22 months ago by Devon Ryan1.9k

http://genome.ucsc.edu/cgi-bin/hgTables

ADD REPLYlink written 22 months ago by shraddha.adamane10

Right, could you post the first 10-20 lines then?

ADD REPLYlink written 22 months ago by Devon Ryan1.9k

chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene exon 12613 12721 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; chr1 hg19_knownGene exon 12646 12697 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; chr1 hg19_knownGene start_codon 12190 12192 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 12190 12227 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 12595 12721 0.000000 + 1 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene exon 12595 12721 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 13403 13636 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene stop_codon 13637 13639 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene exon 13403 14409 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene exon 14362 14829 0.000000 - . gene_id "uc009vis.3"; transcript_id "uc009vis.3"; chr1 hg19_knownGene exon 14970 15038 0.000000 - . gene_id "uc009vis.3"; transcript_id "uc009vis.3"; chr1 hg19_knownGene exon 15796 15942 0.000000 - . gene_id "uc009vis.3"; transcript_id "uc009vis.3"; chr1 hg19_knownGene exon 16607 16765 0.000000 - . gene_id "uc009vis.3"; transcript_id "uc009vis.3"; chr1 hg19_knownGene exon 16858 17055 0.000000 - . gene_id "uc009vjc.1"; transcript_id "uc009vjc.1"; chr1 hg19_knownGene exon 17233 17751 0.000000 - . gene_id "uc009vjc.1"; transcript_id "uc009vjc.1"; chr1 hg19_knownGene exon 15796 15947 0.000000 - . gene_id "uc009vjd.2"; transcript_id "uc009vjd.2"; chr1 hg19_knownGene exon 16607 16765 0.000000 - . gene_id "uc009vjd.2"; transcript_id "uc009vjd.2"; chr1 hg19_knownGene exon 16858 17055 0.000000 - . gene_id "uc009vjd.2"; transcript_id "uc009vjd.2"; chr1 hg19_knownGene exon 17233 17368 0.000000 - . gene_id "uc009vjd.2"; transcript_id "uc009vjd.2"; chr1 hg19_knownGene exon 17606 18061 0.000000 - . gene_id "uc009vjd.2"; transcript_id "uc009vjd.2"; chr1 hg19_knownGene exon 14362 14829 0.000000 - . gene_id "uc009vit.3"; transcript_id "uc009vit.3"; chr1 hg19_knownGene exon 14970 15038 0.000000 - . gene_id "uc009vit.3"; transcript_id "uc009vit.3";

ADD REPLYlink written 22 months ago by shraddha.adamane10

Odd, your file obviously has exon entries. You might submit a bug report so the admin of the server can directly have a look. This will take him/her all of a couple minutes to debug, since the files are immediately at hand.

ADD REPLYlink written 22 months ago by Devon Ryan1.9k

HI Devon Ryan, If I am getting right, when you asked me to upload 10-20 lines, you meant those from GTF file and not from my study file right? Just thought of checking it with you.

ADD REPLYlink written 22 months ago by shraddha.adamane140

Right, you uploaded exactly what I wanted to see. That's actually the weird thing about it, since what you have should work perfectly. My only guess is that there's something like having spaces instead of tabs or some other weird sort of thing that's mucking everything up. The server admin might need to play around with the file a bit to see what the issue is.

ADD REPLYlink written 22 months ago by Devon Ryan1.9k
0
gravatar for shraddha.adamane
22 months ago by
shraddha.adamane10 wrote:

HI Devon Ryan, I guess I tried uploading both Ensembl and UCSC GTF files. And The errors I got are : Error with Ensembl GTF file: error An error occurred with this dataset: Fatal error: Unknown error occured 0 GFF lines processed. Warning: No features of type 'exon' found. Error occured when processing SAM input (record #583 in file /galaxy-repl/main/files/016/925/dataset_16925703.dat): tid -1 out of range 0<=tid<25 [Ex

The error with UCSC GTF files is: error An error occurred with this dataset: Fatal error: Unknown error occured 100000 GFF lines processed. 200000 GFF lines processed. 300000 GFF lines processed. 400000 GFF lines processed. 500000 GFF lines processed. 600000 GFF lines processed. 700000 GFF lines processed. 800000 GFF lines proces

I have reported the admin about the error in UCSC GTF files. Would you suggest anything else or I should await for the reply from Admin?

Thanks

ADD COMMENTlink written 22 months ago by shraddha.adamane10

You'll have to wait for the admin to respond.

ADD REPLYlink written 22 months ago by Devon Ryan1.9k

Okk.. I will wait.. Thanks ..

ADD REPLYlink written 22 months ago by shraddha.adamane10
0
gravatar for Joey Zou
5 months ago by
Joey Zou0
Joey Zou0 wrote:

Helloļ¼ŒDevon Ryan When I ran htseq with my data. I am getting the error:

Warning: No features of type 'exon' found.
Warning: Read E00488:175:H3CYCCCXY:3:1101:1610:63173 claims to have an aligned mate which could not be found in an adjacent line.
Warning: 2235041 reads with missing mate encountered.

Could you please help me to solve this problem. I don't konw if the gff file was broken or something problem, and I should need to convert gff file to gtf file? I really appreciate it, if you can do me a favor. My gff files looks like the following below:

scaffold1002    GLEAN   mRNA    20257   25173   0.970574    +   .   ID=DUH000001.2;source_id=RHOqdgD_GLEAN_10019967;
scaffold1002    GLEAN   CDS 20257   20352   .   +   0   Parent=DUH000001.2;
scaffold1002    GLEAN   CDS 20458   20579   .   +   0   Parent=DUH000001.2;
scaffold1002    GLEAN   CDS 20917   21050   .   +   1   Parent=DUH000001.2;
scaffold1002    GLEAN   CDS 21578   21728   .   +   2   Parent=DUH000001.2;
scaffold1002    GLEAN   CDS 22844   23115   .   +   1   Parent=DUH000001.2;
scaffold1002    GLEAN   CDS 24360   24475   .   +   2   Parent=DUH000001.2;
scaffold1002    GLEAN   CDS 24871   25173   .   +   0   Parent=DUH000001.2;
scaffold1002    Cuff    mRNA    29262   31510   1000    +   .   ID=DUH000002.1;source_id=CUFF1.6.1;
scaffold1002    Cuff    UTR_5   29262   29749   1000    +   .   Parent=DUH000002.1;support_id=CUFF1.6.1;
scaffold1002    Cuff    CDS 29750   31108   1000    +   0   Parent=DUH000002.1;
scaffold1002    Cuff    UTR_3   31109   31510   1000    +   .   Parent=DUH000002.1;support_id=CUFF1.6.1;
scaffold1002    Cuff    mRNA    78137   83175   1000    +   .   ID=DUH000003.1;source_id=CUFF1.7.1;
scaffold1002    Cuff    UTR_5   81341   81400   1000    +   .   Parent=DUH000003.1;support_id=CUFF1.7.1;
scaffold1002    Cuff    CDS 81401   81403   1000    +   0   Parent=DUH000003.1;
scaffold1002    Cuff    CDS 82430   82660   1000    +   0   Parent=DUH000003.1;
scaffold1002    Cuff    UTR_3   82661   83175   1000    +   .   Parent=DUH000003.1;support_id=CUFF1.7.1;
ADD COMMENTlink written 5 months ago by Joey Zou0

Please post new questions as new posts in the future and not as answers to old posts.

As the first warning indicates, you don't have any exons in your annotation. I have no idea how that happened, but you'll need to figure it out.

ADD REPLYlink written 5 months ago by Devon Ryan1.9k
0
gravatar for Joey Zou
5 months ago by
Joey Zou0
Joey Zou0 wrote:

Hi,Devon Ryanhanks. Thanks for your help. Please excuse me. I'll follow the rule of biostar community. Thanks a lot.

ADD COMMENTlink written 5 months ago by Joey Zou0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 136 users visited in the last hour