Question: Attaching annotations to Sequences
1
gravatar for kstewar333
3.4 years ago by
kstewar33310
United States
kstewar33310 wrote:

I have two files: one containing my original transcriptome reads and another containing blast+ blastn annotations for the transcriptome. I want to combine the description and subject ID number to the title of my original transcriptome sequences so that way they read with the query ID, the subject ID on the NCBI database, and the description.

For example, my transcriptomes look like this:

>Bta00064
ATGGCGTTCCAACTACTACTTCTCAGCGTCGGTGTCGCTG....

When I blastn the sequences, blast+ created a tab delimited file which looks like this (note: each section is separated by a tab):

Bta00064 Bemisia tabaci strain NJ-Imi cytochrome P450 (CYP6DV5) mRNA, complete cds gi|339896252|gb|JN165250.1| JN165250 96.87 0.0 1340 2242

I want to make a new file where the transcriptomes have fasta titles that contain the query ID, subject ID, and subject description like this:

>Bta00064 Bemisia tabaci strain NJ-Imi cytochrome P450 (CYP6DV5) mRNA, complete cds gi|339896252|gb|JN165250.1

ATGGCGTTCCAACTACTACTTCTCAGCGTCGGTGTCGCTG....

I'm having trouble doing this. my first file is a fasta with my sequences, my second file is the blast+ outputs I formatted as (query ID, subject description, subject ID, percent match, e-value, length, bit value). I want to make a fasta with the original sequences but with the titles turned into >query ID, subject description, subject ID.

Galaxy recognizes these columns, but I can't seem to combine the two. Basically, how do I combine two files so that they align based on a single factor and then turn them into FASTAs using specific columns?

Thank you for the help!

 

rna-seq alignment galaxy • 892 views
ADD COMMENTlink modified 3.4 years ago by Guy Reeves1.0k • written 3.4 years ago by kstewar33310
1
gravatar for Guy Reeves
3.4 years ago by
Guy Reeves1.0k
Germany
Guy Reeves1.0k wrote:

HI 

First check to see if there is not a shared workflow which does not already do this.

If you cannot find one this might work, I have not tried this myself but what about using use galaxy.org to

A    use 'convert formats>FASTA-to-Tabular converter' on 

>Bta00064

ATGGCGTTCCAACTACTACTTCTCAGCGTCGGTGTCGCTG....

 

B   Next and 'Text Manipulation>Convert delimiters to TAB' converting white space to Tabs on 

Bta00064 Bemisia tabaci strain NJ-Imi cytochrome P450 (CYP6DV5) mRNA, complete cds gi|339896252|gb|JN165250.1| JN165250 96.87 0.0 1340 2242

The output  'Merge Columns together' merging all the columns but the first column leaving 'Bta00064 ' as the first column.

Add a column with '>'  in it and merge with first column to give  '>Bta00064 '

 

C    Once you have your two tab files ready where the first column of each is  '>Bta00064 ' use the tool Join, Subtract and Group'>Join two Datasets side by side on a specified field'  using the first column in each file.

 

D     Finally take the joined dataset file and use 'Tabular-to-FASTA converts tabular file to FASTA format'  to put back into fasta format with long identifier you want.

If you think this is worth trying  would suggest testing this workflow with a fasta file  small number of sequences to get it too work-then make a workflow.  Do say if it works.

Thanks  Guy

 

ADD COMMENTlink written 3.4 years ago by Guy Reeves1.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 170 users visited in the last hour