Confusion From "Extract Genomic Dna (Version 2.2.3) "

Question: Confusion From "Extract Genomic Dna (Version 2.2.3) "

4.8 years ago by

Hi Galaxy team, I recently met two problems when I used " Fetch sequences> Extract Genomic DNA" in Galaxy Main instance. I wanted to extract the exons according to the coordinates in a GFF3 file from my reference sequences (from history, and I specified the genome build) which are in FASTA format. But after checking the output, I found: 1. The first base of each extracted exon was missing in the output, so each extracted exon sequence is one nucleotide shorter than the real length. 2. Some extracted exons are correct,but some extracted exons are wrong. The questionable exons could not be found in the corresponding reference. I can not figure out where they are from. I tried to read the manual/warnings in the page. But I have no idea with my strange output. Could anyone give me some clues,please? Thanks. Best, Miranda

gff • 1.1k views

ADD COMMENT • link •

modified 4.8 years ago by Jennifer Hillman Jackson ♦ 25k • written 4.8 years ago by Lu, Mengmeng • 50

4.8 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello Miranda, The problem is most likely that GFF3 is not supported by the tool (GFF & GTF definitely are). If this is in fact true, I will open a development ticket to block the datatype as being an accepted input type until (or if) GFF3 is included. This datatype does not have content organized the same way as the other input types, so supporting it may have a few wrinkles. That said, the formats are all similar enough in the key fields used by the tool that it might work on GFF3- or rather, mostly work - if you are willing to accept some duplications in the output. I haven't gone through all potential scenarios to see what might come that is odd/different. Not including any fasta sequence or comment lines at the end of the GFF3 file is the first format issue to adjust that comes to mind. But, just as a guess for your results right now - regarding the "sequence" that you cannot locate, perhaps these are coordinate regions associated with the negative strand? The resulting fasta will be reported as a reverse-complement of the reference genomic. When interpreting coordinates for these negatively stranded regions, you won't need to account for the end being 0-based (instead of the start). All of these file types have a 1-based start, not a 0-based start coordinate (unlike bed, interval). If you are used to bed/interval format, this may explain why the start seems off by one. https://wiki.galaxyproject.org/Learn/Datatypes#GFF Please review the data in this context and see if this helps to explain it. Then try using a GFF/GTF or even just an interval version of the coordinates, if possible. Tools in 'Text Manipulation' plus 'Filter and Sort' should be able to help transform the file. And we'll post an update if there is more to share. Hopefully this helps! Jen Galaxy team -- Jennifer Hillman-Jackson http://galaxyproject.org

ADD COMMENT • link written 4.8 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »