Question: Impossible to use Htseq-count on BAM files from Tophat2
0
gravatar for thomas.enriquez
15 months ago by
thomas.enriquez0 wrote:

Hello,

I'm currently facing troubles using galaxy. I want to compare differentially expressed genes between two treatment groups. I already map my reads on my reference genome (70% remaping) and now I'm trying to obtain the differential expression matrix using Htseq count. (For information, my data are Illumina Hiseq 2500, pair end, 125pb).

I already map my reads on my reference genome thanks to Tophat2 (70%remaping), but when I tried to run Htseq on the Bam files from Htseq send me this error message:

Fatal error: Unknown error occured Error occured when processing GFF file (line 40 of file /opt/galaxy-dist/database/files/002/052/dataset_2052791.dat): Feature DS10_00012179-RA:exon:1059 does not contain a 'gene_id' attribute [Exception type: ValueError, raised in count.py:53]**

I though that maybe it could an issue due to my gff3 file, and I tried to convert it into a gtf file using the GFF to GTF converter. But I obtain the following error message:

Traceback (most recent call last): File "/opt/shed_tools/toolshed.g2.bx.psu.edu/repos/vipints/fml_gff3togtf/6e589f267c14/fml_gff3togtf/gff_to_gtf.py", line 17, in <module> import GFFParser File "/opt/shed_tools/toolshed.g2.bx.psu.edu/repos/vipints/fml_gff3togtf/6e589f267c14/fml_gff3togtf/GFFParser.py", line 20, in <module> import scipy.io as sio ImportError: No module named scipy.io

I read that it could be because my Bam files were not sorted by the gene id. So, I tried to sort my Bam files using the tool sort from the SAMtool suite, and obtain an error message again:

Tool execution generated the following error message: Error running samtools sort. mv: cannot stat `foo.bam': No such file or directory The tool produced the following additional output: [bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files Usage: samtools sort [options...] [in.bam] Options: -l INT Set compression level, from 0 (uncompressed) to 9 (best) -m INT Set maximum memory per thread; suffix K/M/G recognized [768M] -n Sort by read name -o FILE Write final output to FILE rather than standard output -T PREFIX Write temporary files to PREFIX.nnnn.bam -@, --threads INT Set number of sorting and compression threads [1] --input-fmt-option OPT[=VAL] Specify a single input file format option in the form of OPTION or OPTION=VALUE -O, --output-fmt FORMAT[,OPT[=VAL]]... Specify output format (SAM, BAM, CRAM) --output-fmt-option OPT[=VAL] Specify a single output file format option in the form of OPTION or OPTION=VALUE --reference FILE Reference sequence FASTA FILE [null]

I do not understand why I received as much error messages. Does anyone face up a similar issue? Or knows where this problems come from?

Thank you in advance,

Thomas

rna-seq software error • 936 views
ADD COMMENTlink modified 14 months ago • written 15 months ago by thomas.enriquez0
0
gravatar for julie dubois
15 months ago by
julie dubois20
julie dubois20 wrote:

Dear Thomas,

The first error indicates that you have a problem on your gff3 file on line 40, htseq don't retrieve the gene_id and I suppose you want to count according gene_id. How is this line and is it different from other ? Did you correctly indicate the encoding of gene_id in the htseq formular regarding the content of your gff3 features file ?

Julie

ADD COMMENTlink written 15 months ago by julie dubois20

Dear Julie, thank you for responding. Yes, I want to count according gene_id. I'm using galaxy, and in Htseq you can specify the ID attribute, and I specified gene_id.

Here is a link to a txt file with the line 40 of the gff3 file: https://www.dropbox.com/s/3b3gih4iqlzds43/exemple%20gff3.txt?dl=0

The line 40 is the first which the type is "gene", and the first where the attribute is not "ID=scaffold"

I saw that in my gff3 file some lines where separated from others by 3# ("###") as you can see on the txt I send you: I was wondering if is it normal?

Thank you for your help, Thomas

ADD REPLYlink written 14 months ago by thomas.enriquez0
0
gravatar for julie dubois
14 months ago by
julie dubois20
julie dubois20 wrote:

I'm not sure the "###" is normal but I think it's not the biggest problem for htseq. When you say "gene_id" in the option ID attribute, you say to htseq to search exactly this word (gene_id) in one of the column of gff3 file. So in your file there is no "gene_id" word. I don't know what is your gff but if you dowlnolad a gtf from ensembl you can see that there is names gene_id, transcript_id ... in one of column. An you can choose to count with htseq according "gene_id" or according "transcript_id" ...

Here for your file, may be you want to count according "ID" but this choice will not only count on gene but equally on match_part, contig ..., because all of this item have an "ID" in the next column. Other option can be to count according : "Parent" or "Target".

It's according your question.

Julie

ADD COMMENTlink written 14 months ago by julie dubois20
0
gravatar for thomas.enriquez
14 months ago by
thomas.enriquez0 wrote:

Hello , Sorry for the time withtout answers. So i set up Htseq count with the term "transcript_id" and I convert my gff3 into gtf using gffread. Htseq has finished his job without any error report, but the output matrix contains only 0 : Geneid TopHat on data 69 data 16 and data 15: accepted_hits DS10_00000001 0 DS10_00000002 0 DS10_00000003 0 DS10_00000004 0 DS10_00000005 0 DS10_00000006 0 DS10_00000007 0 DS10_00000008 0 DS10_00000009 0

Did you face up a similar problem? Thanks a lot for your help, Thomas

ADD COMMENTlink written 14 months ago by thomas.enriquez0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 171 users visited in the last hour