Question: How To Combine Two Reference Genome (Files) In Galaxy?
gravatar for Binbin You
7.1 years ago by
Binbin You50
Binbin You50 wrote:
Hi all, I have two reference (genome) files. Let's say EAB_FB_MG.fa(total37972 sequences/contigs) and EAB_FB.fa(21272 sequences/contigs). I know there are some common contigs between them. How could I combine/merge them to get a new reference file with all unique contigs (without duplicates)?  Many thanks for any idea!!
ADD COMMENTlink modified 7.1 years ago by Jennifer Hillman Jackson25k • written 7.1 years ago by Binbin You50
gravatar for Jennifer Hillman Jackson
7.1 years ago by
United States
Jennifer Hillman Jackson25k wrote:
Hello, There is a tool from the FASTX-Toolkit to remove duplicated sequences, "Collapse sequences", but it is designed to work on short reads. If the common IDs/sequences are the same between the two files, you could compare them to identify the common and unique entries. The general path would be to first convert the fasta format to tabular using "Convert Formats -> FASTA-to-Tabular" then compare the IDs using "Join, Subtract and Group -> Compare two Datasets". Three comparisons will be needed: 1 - rows unique to file1 2 - rows unique to file2 3 - rows in common Then merge the results using "Text Manipulation -> Concatenate datasets" and convert back to fasta using "Convert Formats -> Tabular-to-FASTA". If the IDs are not the same and the sequences are slightly different, then you will probably need to consider a tool designed to do genome sequence assembly. Hopefully this helps, Jen Galaxy team -- Jennifer Jackson
ADD COMMENTlink written 7.1 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour