Question: Reference Annotation Sources for RNA-seq tools
2
gravatar for jramo033
11 weeks ago by
jramo03320
jramo03320 wrote:

When running htseq count, I got this error message:

Fatal error: Unknown error occured [bam_sort_core] merging from 9 files and 1 in-memory blocks... Error occured when processing GFF file (line 2 of file /galaxy-repl/main/files/026/998/dataset_26998011.dat): Strand must be'+', '-', or '.'.

These are the first lines of my GFF fille and I do not see the error:

#name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID 
uc001aaa.3 chr1 + 11873 14409 11873 11873 3 11873,12612,13220, 12227,12721,14409,  uc001aaa.3 
uc010nxr.1 chr1 + 11873 14409 11873 11873 3 11873,12645,13220, 12227,12697,14409,  uc010nxr.1 
uc010nxq.1 chr1 + 11873 14409 12189 13639 3 11873,12594,13402, 12227,12721,14409, B7ZGX9 uc010nxq.1
gff gft galaxy gff3 htseq • 190 views
ADD COMMENTlink modified 11 weeks ago by Jennifer Hillman Jackson25k • written 11 weeks ago by jramo03320
0
gravatar for Jennifer Hillman Jackson
11 weeks ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

The file appears to be in BED12 format, not GFF/GTF.

Use GTF format or a hybrid GFF-GTF format for this input. Avoid GFF3 format, HT-seq count only accepts the other two. Header "#" lines are also problematic and should be removed when you do find an annotation source.

Gencode and iGenomes both have human reference annotation that is a match for UCSC-sourced reference genomes. The UCSC Table browser will output GTF format for tracks, but the transcript_id and gene_id values will be the same (both will be the "transcript_id"). This is usually an undesirable scientific content issue, as all counts will be "by transcript", even if processed/labeled as being "by gene". Try one of the other sources instead.

FAQs: https://galaxyproject.org/support/#getting-inputs-right

Thanks! Jen, Galaxy team

ADD COMMENTlink written 11 weeks ago by Jennifer Hillman Jackson25k
0
gravatar for jramo033
11 weeks ago by
jramo03320
jramo03320 wrote:

Jen, thank you for your prompt response. How do I import directly to Galaxy from Gencode and iGenomes ( I used "Get data" to import from UCSC but I don't see options for Gencode and iGenomes )

ADD COMMENTlink written 11 weeks ago by jramo03320

These data do not have a distinct "Get Data" tool.

  • For Gencode, copy the link to the GTF and paste it into the Upload tool. Hg38 data is here https://www.gencodegenes.org/releases/current.html. After it is loaded, remove the headers (lines that start with a "#") with the Select tool using the options "NOT Matching" with the regular expression ^#. Once the formatting is fixed, change the datatype to be gft under Edit Attributes (pencil icon). The data will be given the datatype gff by default, which works fine with some tools and but not with others. Avoid the gff3 version of this particular data (contains duplicated IDs and several RNA-seq tools do not work with annotation in that format anyway).

  • For iGenomes, the archive corresponding to the target genome/build needs to be locally downloaded, the tar archive unpacked, and then just the genes.gtf data uploaded to Galaxy (browse the local file, or use FTP). Find all available genome/builds here: https://support.illumina.com/sequencing/sequencing_software/igenome.html

Upload FAQs: https://galaxyproject.org/support/ >> Loading Data

Hope that helps!

ADD REPLYlink modified 5 weeks ago • written 11 weeks ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 172 users visited in the last hour