Question: Confusion From "Extract Genomic Dna (Version 2.2.3) "
gravatar for Lu, Mengmeng
4.8 years ago by
Lu, Mengmeng50
Lu, Mengmeng50 wrote:
Hi Galaxy team, I recently met two problems when I used " Fetch sequences> Extract Genomic DNA" in Galaxy Main instance. I wanted to extract the exons according to the coordinates in a GFF3 file from my reference sequences (from history, and I specified the genome build) which are in FASTA format. But after checking the output, I found: 1. The first base of each extracted exon was missing in the output, so each extracted exon sequence is one nucleotide shorter than the real length. 2. Some extracted exons are correct,but some extracted exons are wrong. The questionable exons could not be found in the corresponding reference. I can not figure out where they are from. I tried to read the manual/warnings in the page. But I have no idea with my strange output. Could anyone give me some clues,please? Thanks. Best, Miranda
gff • 1.1k views
ADD COMMENTlink modified 4.8 years ago by Jennifer Hillman Jackson25k • written 4.8 years ago by Lu, Mengmeng50
gravatar for Jennifer Hillman Jackson
4.8 years ago by
United States
Jennifer Hillman Jackson25k wrote:
Hello Miranda, The problem is most likely that GFF3 is not supported by the tool (GFF & GTF definitely are). If this is in fact true, I will open a development ticket to block the datatype as being an accepted input type until (or if) GFF3 is included. This datatype does not have content organized the same way as the other input types, so supporting it may have a few wrinkles. That said, the formats are all similar enough in the key fields used by the tool that it might work on GFF3- or rather, mostly work - if you are willing to accept some duplications in the output. I haven't gone through all potential scenarios to see what might come that is odd/different. Not including any fasta sequence or comment lines at the end of the GFF3 file is the first format issue to adjust that comes to mind. But, just as a guess for your results right now - regarding the "sequence" that you cannot locate, perhaps these are coordinate regions associated with the negative strand? The resulting fasta will be reported as a reverse-complement of the reference genomic. When interpreting coordinates for these negatively stranded regions, you won't need to account for the end being 0-based (instead of the start). All of these file types have a 1-based start, not a 0-based start coordinate (unlike bed, interval). If you are used to bed/interval format, this may explain why the start seems off by one. Please review the data in this context and see if this helps to explain it. Then try using a GFF/GTF or even just an interval version of the coordinates, if possible. Tools in 'Text Manipulation' plus 'Filter and Sort' should be able to help transform the file. And we'll post an update if there is more to share. Hopefully this helps! Jen Galaxy team -- Jennifer Hillman-Jackson
ADD COMMENTlink written 4.8 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour