Question: Extracting Sequences For Transcripts From Reference Genome
gravatar for Lizex Husselmann
5.6 years ago by
Lizex Husselmann80 wrote:
Dear Galaxy community I'm new to galaxy and would like to ask the following: I have trimmed, QC'ed my data received from Illumina HiScan SQ, paired and single end data. Mapped using Tophat, run cufflinks, cuffmerge and cuffdiff. I would like to analyze the gene_exp.diff file by extracting the significant transcripts. I've used grep "yes" to extract only the significant transcripts. From this info I have the locus start and end coordinates of each transcript for example "XLOC_000544 XLOC_000544 - chr1:12763969-12765675 C0 C4 OK 3.16487 1628.25 9.00696 -4.57022 4.8722e-06 0.00905256 yes". How can I go about to extract this information/or sequence from the reference genome. Kind regards Lizex This message is confidential and may be covered by legal professional privilege. It must not be read, copied, disclosed or used in any other manner by any person other than the addressee(s). Unauthorised use, disclosure or copying is strictly prohibited and may be unlawful. The views expressed in this email are those of the sender, unless otherwise stated. If you have received this email in error, please contact ARC Service Desk immediately. To report incidents of fraud and / or corruption in the ARC use our Ethics Hotline by: Phone number : 0800 000 604 Fax number : 0800 00 7788 Email address : Please Call me : 32840 Website: For more information on the ARC Ethics Hotline, please visit our website at
rna-seq cuffmerge cufflinks • 1.3k views
ADD COMMENTlink modified 5.6 years ago by Jennifer Hillman Jackson25k • written 5.6 years ago by Lizex Husselmann80
gravatar for Jennifer Hillman Jackson
5.6 years ago by
United States
Jennifer Hillman Jackson25k wrote:
Hi Lizex, It sounds like you are working on the command line and want to now import data into Galaxy to work with it? If so, I'll add in an extra comment to be careful about the reference genome when moving into Galaxy: stances To get the data into Galaxy - use FTP: The gene expression file's XLOC IDs are the same as those in the GTF file's attribute field (9th field), used as input to Cuffdiff. To get the transcript sequence, you basically want to match up those identifiers, then extract the sequence from the reference genome. (Note that this will not include any base-level variation from your sequence data - this method is creating transcripts, using the genomic, based off coordinates. This tool packages does not assemble new consensus sequences.) The general path is: 0 - upload the "gene differential expression testing", GTF file, and reference genome if needed 2 - cut out the "XLOC" field from the " gene differential expression testing" file using the tool "Text Manipulation -> Cut" 3 - use the tool " Filter and Sort -> Filter GTF data by attribute values_list" to obtain only records related to your XLOC list 4 - obtain fasta sequence with the tool "Fetch Sequences -> Extract Genomic DNA" using the result from 3 as the query and your uploaded reference genome as a "Custom reference genome" if needed. More about custom reference genomes & RNA seq tools is in these links: Hopefully this helps, Jen Galaxy team -- Jennifer Hillman-Jackson Galaxy Support and Training
ADD COMMENTlink written 5.6 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 172 users visited in the last hour