Question: Duplicated sequences within a gene_id in fasta entry?
gravatar for jjrin
10 weeks ago by
jjrin10 wrote:


I used the Extract Genomic DNA function in Galaxy and it outputted a fasta file using my original genome fasta and my annotation. However, in each entry of the multifasta file, the sequences are duplicated twice.

For example...


If you notice, "TATATTTATTAATTTACGGGATC" is duplicated twice in the single entry. This happens to all of genes in the fasta file. When looking at the coordinates, the number of nucleotides does not add up because the sequence is duplicated within it and is twice the supposed size according to the coordinates. For example, coordinates of 700 - 750 should have 50 nucleotides but in my fasta file it has 100.

I have run this function in galaxy twice and I still have this problem. I am sure that my annotation has the correct coordinates and that there is nothing wrong with the genome as it works for all of my other functions. Thanks for the help!

ADD COMMENTlink modified 10 weeks ago by Jennifer Hillman Jackson22k • written 10 weeks ago by jjrin10

If this using your own local Galaxy and installed indexes, would you share a few lines of the input file that produce this kind of output and I'll test at to see if I can reproduce the output? That will narrow down the problem space. Please also include the setting used on the tool form.

If you are working already at Galaxy Main (, a share link to the history with the data would be even better. You can post it here publically if you do not mind everyone seeing your history, or, generate the history share link and include it in an email to (private list). Note the dataset numbers for input/output and also please include a link to this post so we can associate the two.

Thanks! Jen, Galaxy team

ADD REPLYlink written 10 weeks ago by Jennifer Hillman Jackson22k

Here is the first gene that appears in the galaxy produced fasta file (gtf annotation/genome fasta file)

Scaffold100 StringTie   transcript  65415   65755   .   +   .   transcript_id "MSTRG.5.1"; gene_id "MSTRG.5"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
Scaffold100 StringTie   exon    65415   65755   .   +   .   transcript_id "MSTRG.5.1"; gene_id "MSTRG.5"; exon_number "1";

Fasta file is a typical version:


Here is how the galaxy fasta file output looks like


The number of bases given is 695 compared to the 340 that there are supposed to be.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by jjrin10
gravatar for Jennifer Hillman Jackson
10 weeks ago by
United States
Jennifer Hillman Jackson22k wrote:


The tool is not handling overlapping/duplicated entries well. When given in this specific way (identical coordinates for an exon + transcript), it seems to be outputting buggy results. The version of the tool at Galaxy Main ( is not the most current available (was just recently updated a few days ago). I am going to test the latest version [Extract Genomic DNA (version 3.0.3)] in a local Galaxy to see if the problem can be reproduced and if so, will open a ticket against the tool repository. One of the current test cases (test3) for the tool includes data like yours, so it may be that only a tool update at Galaxy Main is needed, but I cannot confirm that yet. I'll post back with the testing results.

Workaround for now: Isolate the data to be extracted in the GTF for either transcripts or exons before using it as input. The tool Filter and Sort > Filter can be used.

Thanks for reporting the problem! Jen, Galaxy team

ADD COMMENTlink written 10 weeks ago by Jennifer Hillman Jackson22k

Related question about gene_id propagation to the output fasta dataset. Will also be tested.

ADD REPLYlink written 10 weeks ago by Jennifer Hillman Jackson22k

I am having some trouble testing the latest version of this tool effectively. A request to update the public Test and Main servers has been made and follow-up testing to tune the tool for this type of output will occur after that. Meanwhile, use the workaround.

Please follow progress here and feel free to ask for an update about testing once you see it on the public servers.

ADD REPLYlink written 9 weeks ago by Jennifer Hillman Jackson22k

So I tried linearizing the fasta results from Galaxy and it fixed the unequal bases and coordinates issue! I used the answer from here:

Maybe in the future, make Galaxy output the base sequences on a single line rather than many divided lines? It seems to handle the duplicated/overlapping entries better like this!

ADD REPLYlink written 9 weeks ago by jjrin10

Thanks for this suggested correction. I have linked this ticket into the tool update request (post above) and will have it considered as a solution during subsequent testing. Thanks again!

ADD REPLYlink written 9 weeks ago by Jennifer Hillman Jackson22k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 94 users visited in the last hour