merging two unrelated reference genomes and annotation files

Question: merging two unrelated reference genomes and annotation files

2.1 years ago by

US, Tufts University

Hi, I work with an intracellular pathogen. I would like to run an RNA-Seq analysis in galaxy using a combined host-parasite transcriptome as reference and annotation (gtf) file. I expect about 2% of my reads are of pathogen origin and the remaining 98% of host origin. How do I create a combined host (pig) and pathogen (Cryptosporidium) references files with both species merged into one genome file and one annotation file?

thanks!

Giovanni Widmer Tufts University

rna-seq tophat cufflinks custom-genome • 2.1k views

ADD COMMENT • link •

modified 2.1 years ago • written 2.1 years ago by Widmer, Giovanni • 150

2.1 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Merge the two genome fasta files with the tool Concatenate, then use it as a Custom reference genome. Merging the reference annotation will probably work if the files are in GTF format - merge the headers removing redundant lines, merge the data lines, then combine all together. Other tools in the groups Text Manipulation and Filter and Sort can select specific lines from tabular data (GTF is tabular) to create header versus data line intermediate datasets.

GFF3 format is more complicated and it is likely "ID" attribute conflicts will occur if merged.

https://wiki.galaxyproject.org/Support#Custom_reference_genome

Best, Jen, Galaxy team

ADD COMMENT • link modified 2.1 years ago • written 2.1 years ago by Jennifer Hillman Jackson ♦ 25k

2.1 years ago by

Widmer, Giovanni • 150

US, Tufts University

Widmer, Giovanni • 150 wrote:

thanks for your help, Jen. Which Concatenate tool do I use to merge 2 genome fasta files? Concatenate Fasta Alignment by Species (under Fasta Manipulation) seems to be the only tool that requires a FASTA formatted input file. For merging two GTF annotation files, do I use Concatenate datasets tail-to-head (under Text Manipulation)?

Giovanni

ADD COMMENT • link written 2.1 years ago by Widmer, Giovanni • 150

Do this the other way around.

For the fasta datasets, use Concatenate datasets tail-to-head. Make certain there are no extra blank lines between the two after merged. Use the Select tool with the regular expression ^$ to find these.

For the GTF files, extract the headers into new datasets and merge so that there are no duplicated lines. Tools in Text Mani can select lines in many ways by line position (the Select tool can too - but based on content, as described above). It is your choice which method to use. Then extract the data lines into new datasets as well. The Concatenate datasets tail-to-head can be used to assemble all into one dataset at the end.

ADD REPLY • link modified 2.1 years ago • written 2.1 years ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »