Attaching annotations to Sequences

Question: Attaching annotations to Sequences

3.4 years ago by

United States

kstewar333 • 10 wrote:

I have two files: one containing my original transcriptome reads and another containing blast+ blastn annotations for the transcriptome. I want to combine the description and subject ID number to the title of my original transcriptome sequences so that way they read with the query ID, the subject ID on the NCBI database, and the description.

For example, my transcriptomes look like this:

>Bta00064
ATGGCGTTCCAACTACTACTTCTCAGCGTCGGTGTCGCTG....

When I blastn the sequences, blast+ created a tab delimited file which looks like this (note: each section is separated by a tab):

Bta00064 Bemisia tabaci strain NJ-Imi cytochrome P450 (CYP6DV5) mRNA, complete cds gi|339896252|gb|JN165250.1| JN165250 96.87 0.0 1340 2242

I want to make a new file where the transcriptomes have fasta titles that contain the query ID, subject ID, and subject description like this:

>Bta00064 Bemisia tabaci strain NJ-Imi cytochrome P450 (CYP6DV5) mRNA, complete cds gi|339896252|gb|JN165250.1

ATGGCGTTCCAACTACTACTTCTCAGCGTCGGTGTCGCTG....

I'm having trouble doing this. my first file is a fasta with my sequences, my second file is the blast+ outputs I formatted as (query ID, subject description, subject ID, percent match, e-value, length, bit value). I want to make a fasta with the original sequences but with the titles turned into >query ID, subject description, subject ID.

Galaxy recognizes these columns, but I can't seem to combine the two. Basically, how do I combine two files so that they align based on a single factor and then turn them into FASTAs using specific columns?

Thank you for the help!

rna-seq alignment galaxy • 892 views

ADD COMMENT • link •

modified 3.4 years ago by Guy Reeves • 1.0k • written 3.4 years ago by kstewar333 • 10

3.4 years ago by

Guy Reeves • 1.0k

Germany

Guy Reeves • 1.0k wrote:

First check to see if there is not a shared workflow which does not already do this.

If you cannot find one this might work, I have not tried this myself but what about using use galaxy.org to

A use 'convert formats>FASTA-to-Tabular converter' on

>Bta00064

ATGGCGTTCCAACTACTACTTCTCAGCGTCGGTGTCGCTG....

B Next and 'Text Manipulation>Convert delimiters to TAB' converting white space to Tabs on

Bta00064 Bemisia tabaci strain NJ-Imi cytochrome P450 (CYP6DV5) mRNA, complete cds gi|339896252|gb|JN165250.1| JN165250 96.87 0.0 1340 2242

The output 'Merge Columns together' merging all the columns but the first column leaving 'Bta00064 ' as the first column.

Add a column with '>' in it and merge with first column to give '>Bta00064 '

C Once you have your two tab files ready where the first column of each is '>Bta00064 ' use the tool Join, Subtract and Group'>Join two Datasets side by side on a specified field' using the first column in each file.

D Finally take the joined dataset file and use 'Tabular-to-FASTA converts tabular file to FASTA format' to put back into fasta format with long identifier you want.

If you think this is worth trying would suggest testing this workflow with a fasta file small number of sequences to get it too work-then make a workflow. Do say if it works.

Thanks Guy

ADD COMMENT • link written 3.4 years ago by Guy Reeves • 1.0k

Similar posts • Search »