Impossible to use Htseq-count on BAM files from Tophat2

Question: Impossible to use Htseq-count on BAM files from Tophat2

15 months ago by

Hello,

I'm currently facing troubles using galaxy. I want to compare differentially expressed genes between two treatment groups. I already map my reads on my reference genome (70% remaping) and now I'm trying to obtain the differential expression matrix using Htseq count. (For information, my data are Illumina Hiseq 2500, pair end, 125pb).

I already map my reads on my reference genome thanks to Tophat2 (70%remaping), but when I tried to run Htseq on the Bam files from Htseq send me this error message:

Fatal error: Unknown error occured Error occured when processing GFF file (line 40 of file /opt/galaxy-dist/database/files/002/052/dataset_2052791.dat): Feature DS10_00012179-RA:exon:1059 does not contain a 'gene_id' attribute [Exception type: ValueError, raised in count.py:53]**

I though that maybe it could an issue due to my gff3 file, and I tried to convert it into a gtf file using the GFF to GTF converter. But I obtain the following error message:

Traceback (most recent call last): File "/opt/shed_tools/toolshed.g2.bx.psu.edu/repos/vipints/fml_gff3togtf/6e589f267c14/fml_gff3togtf/gff_to_gtf.py", line 17, in <module> import GFFParser File "/opt/shed_tools/toolshed.g2.bx.psu.edu/repos/vipints/fml_gff3togtf/6e589f267c14/fml_gff3togtf/GFFParser.py", line 20, in <module> import scipy.io as sio ImportError: No module named scipy.io

I read that it could be because my Bam files were not sorted by the gene id. So, I tried to sort my Bam files using the tool sort from the SAMtool suite, and obtain an error message again:

Tool execution generated the following error message: Error running samtools sort. mv: cannot stat `foo.bam': No such file or directory The tool produced the following additional output: [bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files Usage: samtools sort [options...] [in.bam] Options: -l INT Set compression level, from 0 (uncompressed) to 9 (best) -m INT Set maximum memory per thread; suffix K/M/G recognized [768M] -n Sort by read name -o FILE Write final output to FILE rather than standard output -T PREFIX Write temporary files to PREFIX.nnnn.bam -@, --threads INT Set number of sorting and compression threads [1] --input-fmt-option OPT[=VAL] Specify a single input file format option in the form of OPTION or OPTION=VALUE -O, --output-fmt FORMAT[,OPT[=VAL]]... Specify output format (SAM, BAM, CRAM) --output-fmt-option OPT[=VAL] Specify a single output file format option in the form of OPTION or OPTION=VALUE --reference FILE Reference sequence FASTA FILE [null]

I do not understand why I received as much error messages. Does anyone face up a similar issue? Or knows where this problems come from?

Thank you in advance,

Thomas

rna-seq software error • 936 views

ADD COMMENT • link •

modified 14 months ago • written 15 months ago by thomas.enriquez • 0

15 months ago by

julie dubois • 20

julie dubois • 20 wrote:

Dear Thomas,

The first error indicates that you have a problem on your gff3 file on line 40, htseq don't retrieve the gene_id and I suppose you want to count according gene_id. How is this line and is it different from other ? Did you correctly indicate the encoding of gene_id in the htseq formular regarding the content of your gff3 features file ?

Julie

ADD COMMENT • link written 15 months ago by julie dubois • 20

Dear Julie, thank you for responding. Yes, I want to count according gene_id. I'm using galaxy, and in Htseq you can specify the ID attribute, and I specified gene_id.

Here is a link to a txt file with the line 40 of the gff3 file: https://www.dropbox.com/s/3b3gih4iqlzds43/exemple%20gff3.txt?dl=0

The line 40 is the first which the type is "gene", and the first where the attribute is not "ID=scaffold"

I saw that in my gff3 file some lines where separated from others by 3# ("###") as you can see on the txt I send you: I was wondering if is it normal?

Thank you for your help, Thomas

ADD REPLY • link written 14 months ago by thomas.enriquez • 0

14 months ago by

julie dubois • 20

julie dubois • 20 wrote:

I'm not sure the "###" is normal but I think it's not the biggest problem for htseq. When you say "gene_id" in the option ID attribute, you say to htseq to search exactly this word (gene_id) in one of the column of gff3 file. So in your file there is no "gene_id" word. I don't know what is your gff but if you dowlnolad a gtf from ensembl you can see that there is names gene_id, transcript_id ... in one of column. An you can choose to count with htseq according "gene_id" or according "transcript_id" ...

Here for your file, may be you want to count according "ID" but this choice will not only count on gene but equally on match_part, contig ..., because all of this item have an "ID" in the next column. Other option can be to count according : "Parent" or "Target".

It's according your question.

Julie

ADD COMMENT • link written 14 months ago by julie dubois • 20

14 months ago by

thomas.enriquez • 0

thomas.enriquez • 0 wrote:

Hello , Sorry for the time withtout answers. So i set up Htseq count with the term "transcript_id" and I convert my gff3 into gtf using gffread. Htseq has finished his job without any error report, but the output matrix contains only 0 : Geneid TopHat on data 69 data 16 and data 15: accepted_hits DS10_00000001 0 DS10_00000002 0 DS10_00000003 0 DS10_00000004 0 DS10_00000005 0 DS10_00000006 0 DS10_00000007 0 DS10_00000008 0 DS10_00000009 0

Did you face up a similar problem? Thanks a lot for your help, Thomas

ADD COMMENT • link written 14 months ago by thomas.enriquez • 0

Please log in to add an answer.

Similar posts • Search »