Duplicated sequences within a gene

Question: Duplicated sequences within a gene_id in fasta entry?

16 months ago by

jjrin • 10

jjrin • 10 wrote:

Hello,

I used the Extract Genomic DNA function in Galaxy and it outputted a fasta file using my original genome fasta and my annotation. However, in each entry of the multifasta file, the sequences are duplicated twice.

For example...

>gene.1 
TATATTTATTAATTTACGGGACTATATTTATTAATTTACGGGATC

If you notice, "TATATTTATTAATTTACGGGATC" is duplicated twice in the single entry. This happens to all of genes in the fasta file. When looking at the coordinates, the number of nucleotides does not add up because the sequence is duplicated within it and is twice the supposed size according to the coordinates. For example, coordinates of 700 - 750 should have 50 nucleotides but in my fasta file it has 100.

I have run this function in galaxy twice and I still have this problem. I am sure that my annotation has the correct coordinates and that there is nothing wrong with the genome as it works for all of my other functions. Thanks for the help!

fasta extract genomic dna galaxy rna-seq • 627 views

ADD COMMENT • link •

modified 16 months ago by Jennifer Hillman Jackson ♦ 25k • written 16 months ago by jjrin • 10

If this using your own local Galaxy and installed indexes, would you share a few lines of the input file that produce this kind of output and I'll test at http://usegalaxy.org to see if I can reproduce the output? That will narrow down the problem space. Please also include the setting used on the tool form.

If you are working already at Galaxy Main (http://usegalaxy.org), a share link to the history with the data would be even better. You can post it here publically if you do not mind everyone seeing your history, or, generate the history share link and include it in an email to galaxy-bugs@lists.galaxyproject.org (private list). Note the dataset numbers for input/output and also please include a link to this post so we can associate the two.

Thanks! Jen, Galaxy team

ADD REPLY • link written 16 months ago by Jennifer Hillman Jackson ♦ 25k

Here is the first gene that appears in the galaxy produced fasta file (gtf annotation/genome fasta file)

Scaffold100 StringTie   transcript  65415   65755   .   +   .   transcript_id "MSTRG.5.1"; gene_id "MSTRG.5"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
Scaffold100 StringTie   exon    65415   65755   .   +   .   transcript_id "MSTRG.5.1"; gene_id "MSTRG.5"; exon_number "1";

Fasta file is a typical version:

>Scaffold100
gttcaaactttcataacctgccaaattttgtaaaatgaacatggtgtggccacaaaaatggtcgtggtcaaaaaattcac
tgcgcgcaagtttttttgtccctctttttatttccaaaatgttgggaggtATttagacatttcacatttatattctcata
taccacactttccacattacacaATCTTCAGCCACCTTATAAATGGAACCTGCCGGTGCTACTGCTTTACATTGCTCTCT
CTTTATATAAATGAATGAGGCAATGTGATAGTGGAAATGTGTCCGATATAATGTAATTATAAGGTTCGACACTTAGAATA
AGAGAAGGAAAACTCTTCTGCATGTGATGAGAATCCTGCAGTGTGAGATGTGGCCTCAAGGGGGAGACATTTCCTTCACC
AAATTATCCCAGAGAAAGATCATTTTTATTCTGAGTTGGAAGCTGAGGAACGCGTTTCGTGTTTAATTCCATTTAAAAAA
AAAAACTTTTAAAAGGCGATAAAACGAATCTCTTGCAAATTTCCTATGTCCAACTGCCCTGTGATTCTATTGGGGGGGTG
TCCTCCCATGGGAACTGCCTGGCGACAGATTGAGGACGAGGTTTTTCATGGGGAATTTAAAACAATTTATCAACAGAAAC

Here is how the galaxy fasta file output looks like

>?_Scaffold100_65415_65755_+
TATATTTATTAATTTATAAAAGTGTAATGATAACTACTTTTAAGACAGCT
GACACATTGTTTCATCCCTAGCATGGCAGGCCTAAGAGATATGTTCATCA
TGGGAAAGCTGTCCTCATTAGAGTCCCTCATCATACAGGAGAAGTCCCAC
ATCACACACACTATTCTTCCATTACTCACACGTCACATGAAGAAAACAGA
CCTGAAGGGTTCAGGACTGGACGTCCCTTGCTACCCATAAAAAATGAGGT
GAGCAAAGCAATTAGCATTTTGACCATCTATGGGAGTGACTAACTATAAG
AGTGACCATCTATAGGAGTGACCAACTATAAGAGTGATCATTATATTTAT
TAATTTATAAAAGTGTAATGATAACTACTTTTAAGACAGCTGACACATTG
TTTCATCCCTAGCATGGCAGGCCTAAGAGATATGTTCATCATGGGAAAGC
TGTCCTCATTAGAGTCCCTCATCATACAGGAGAAGTCCCACATCACACAC
ACTATTCTTCCATTACTCACACGTCACATGAAGAAAACAGACCTGAAGGG
TTCAGGACTGGACGTCCCTTGCTACCCATAAAAAATGAGGTGAGCAAAGC
AATTAGCATTTTGACCATCTATGGGAGTGACTAACTATAAGAGTGACCAT
CTATAGGAGTGACCAACTATAAGAGTGATCAT

The number of bases given is 695 compared to the 340 that there are supposed to be.

ADD REPLY • link modified 16 months ago • written 16 months ago by jjrin • 10

16 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

The tool is not handling overlapping/duplicated entries well. When given in this specific way (identical coordinates for an exon + transcript), it seems to be outputting buggy results. The version of the tool at Galaxy Main (http://usegalaxy.org) is not the most current available (was just recently updated a few days ago). I am going to test the latest version [Extract Genomic DNA (version 3.0.3)] in a local Galaxy to see if the problem can be reproduced and if so, will open a ticket against the tool repository. One of the current test cases (test3) for the tool includes data like yours, so it may be that only a tool update at Galaxy Main is needed, but I cannot confirm that yet. I'll post back with the testing results.

Workaround for now: Isolate the data to be extracted in the GTF for either transcripts or exons before using it as input. The tool Filter and Sort > Filter can be used.

Thanks for reporting the problem! Jen, Galaxy team

ADD COMMENT • link written 16 months ago by Jennifer Hillman Jackson ♦ 25k

Related question about gene_id propagation to the output fasta dataset. Will also be tested. https://biostar.usegalaxy.org/p/23823/

ADD REPLY • link written 16 months ago by Jennifer Hillman Jackson ♦ 25k

I am having some trouble testing the latest version of this tool effectively. A request to update the public Test and Main servers has been made and follow-up testing to tune the tool for this type of output will occur after that. Meanwhile, use the workaround.

Please follow progress here and feel free to ask for an update about testing once you see it on the public servers. https://github.com/galaxyproject/galaxy/issues/4320

ADD REPLY • link written 16 months ago by Jennifer Hillman Jackson ♦ 25k

So I tried linearizing the fasta results from Galaxy and it fixed the unequal bases and coordinates issue! I used the answer from here: https://www.biostars.org/p/261820/#262258

Maybe in the future, make Galaxy output the base sequences on a single line rather than many divided lines? It seems to handle the duplicated/overlapping entries better like this!

ADD REPLY • link written 16 months ago by jjrin • 10

Thanks for this suggested correction. I have linked this ticket into the tool update request (post above) and will have it considered as a solution during subsequent testing. Thanks again!

ADD REPLY • link written 16 months ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »