I have two reference (genome) files. Let's say EAB_FB_MG.fa(total37972
sequences/contigs) and EAB_FB.fa(21272 sequences/contigs). I know
there are some common contigs between them. How could I combine/merge
them to get a new reference file with all unique contigs (without
Many thanks for any idea!!
There is a tool from the FASTX-Toolkit to remove duplicated sequences,
"Collapse sequences", but it is designed to work on short reads.
If the common IDs/sequences are the same between the two files, you
could compare them to identify the common and unique entries. The
general path would be to first convert the fasta format to tabular
"Convert Formats -> FASTA-to-Tabular" then compare the IDs using
Subtract and Group -> Compare two Datasets".
Three comparisons will be needed:
1 - rows unique to file1
2 - rows unique to file2
3 - rows in common
Then merge the results using "Text Manipulation -> Concatenate
and convert back to fasta using "Convert Formats -> Tabular-to-FASTA".
If the IDs are not the same and the sequences are slightly different,
then you will probably need to consider a tool designed to do genome
Hopefully this helps,