Question: Extracting sequences from FASTA file
3.5 years ago by
United States
d.gerrard80 wrote:


I have been using on a mac to extract sequences from a fasta file. I have a file called 'Trinity.fasta' that has fasta sequences with identifiers 'comp#_c#_seq#' for instance, 'comp1_c0_seq1'. I also have an in text file for the specific contig identifiers that I would like to get sequences for but the identifiers are written as 'comp55698_c0'. As you can see the '_seq#' is missing.


Is there another program that I could use that would allow me to say that the _seq is missing?




3.5 years ago by
United States
Jennifer Hillman Jackson25k wrote:


This is not really a Galaxy question .. but I can't help but share a simple and super useful line-command option that will work here.

  1. Make a clone-copy of the file you intend to modify (the one with the "compNNNNN_c0_seq1" content). Put the backup file in another place completely - like a directory labeled as "YYYYMMDD_originals_experiment-name" or something else obvious. (sub-directories tend to not be the best place for backups IMHO .. too easy to "rm -rf" and lose it all)
  2. With the working copy of the file - let's call it "contigs.fasta" - execute the following at the prompt ($ == prompt):

$ sed 's/_seq1//' configs.fasta > configs_clean.fasta

There are literally at least 30 ways to do this sort of manipulation, sed is just my favorite line-command. Short and sweet.

The identifiers could also be modified to add-on the "_seq" bit. I personally would use "vi" or whatever your favorite text editor is for that.

  1. Backup! and call the file something like "identifiers.txt"
  2. Assumption: file is a single column list of identifiers, only!
$ vi identifiers.txt

within vi, while escaped (hit "esc" key if needed), type:

:%s/$/_seq1/ (hit return)

Either can be done within Galaxy itself using text manipulation tools. More about these commands & options can be googled.

Best, Jen, Galaxy team

