Linking exons to genes from UCSC human

Question: Linking exons to genes from UCSC human

10 months ago by

jimhu • 20

jimhu • 20 wrote:

GOAL: identify exons by what gene they come from

Posting this after being cajoled by @galaxyproject to ask support questions here and not just on twitter...

I'm teaching a short module on Big Data for first-year students where I include a couple of sessions on Galaxy. I started with the Galaxy 101 tutorial of finding the exons on a human chromosome with the most SNPs. I wanted to extend the lesson in a couple of ways and build a homework assignment based on asking which genes on a particular chromosome had most SNPs that fall into exons. But then I realized I don't know how to do that myself. Here's what I came up with

The regex option

The exons table resulst looks like this

chr21   5011798 5011874 uc061ytv.1_exon_0_0_chr21_5011799_f 0   +
chr21   5012547 5012687 uc061ytv.1_exon_1_0_chr21_5012548_f 0   +
chr21   5014385 5014471 uc061ytv.1_exon_2_0_chr21_5014386_f 0   +
chr21   5016934 5017145 uc061ytv.1_exon_3_0_chr21_5016935_f 0   +
chr21   5022492 5022693 uc032pql.2_exon_0_0_chr21_5022493_f 0   +
chr21   5025008 5025049 uc032pql.2_exon_1_0_chr21_5025009_f 0   +
chr21   5026279 5026630 uc032pql.2_exon_2_0_chr21_5026280_f 0   +
...

Because the first part of the exon id is the gene id, I could run regex-based search and replace on the first underscore. After much regex debugging because I'm not used to awk, I came up with

Replace text in a specific column Column 4 Find pattern [A-Za-z0-9]+.[0-9]+ Replace pattern &\t&

which gives:

    chr21   5011798 5011874 uc061ytv.1  uc061ytv.1_exon_0_0_chr21_5011799_f 0   +
    chr21   5012547 5012687 uc061ytv.1  uc061ytv.1_exon_1_0_chr21_5012548_f 0   +
    chr21   5014385 5014471 uc061ytv.1  uc061ytv.1_exon_2_0_chr21_5014386_f 0   +
    chr21   5016934 5017145 uc061ytv.1  uc061ytv.1_exon_3_0_chr21_5016935_f 0   +
...

I'm not crazy about this based on it relying on a specific identifier format convention (which I guessed - documented where?) and having to teach regexes in a short class.

The interval join option

The alternative would be to take the gene/mRNA list and do an interval join based on coordinate overlaps. This has a couple of drawbacks that I can think of

Slow (but once done the dataset can be reused)
Overlapping genes (these exist in bacteriophage; not sure about humans)

Using this tutorial pointed out by Dave Clements, I can even do this using other kinds of identifiers, like UniProt accessions instead of the UCSC ids.

Other

I'm hoping that readers here can fill in this with something that is more generalizable that takes advantage of the existing parent-child feature relationship between exons and genes.

snp galaxy101 ucsc genes galaxy • 465 views

ADD COMMENT • link •

modified 9 months ago • written 10 months ago by jimhu • 20

10 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

The 101 tutorial's goal is to give a general overview of functions. To go further, such as grouping by gene symbol, more steps are needed if this is your starting place. These extra tools/steps can be bundled into a workflow, but as you state, the details for parsing the data will vary by the source of the data used. This is due to the content that different data providers include in the tracks when exported (to anywhere, not just Galaxy).

In short, one can only start work with the data as modeled and available from the data source. Then it can be manipulated.

Most BED6 datasets do not contain gene symbols unless specifically manipulated to contain that value in the "name" columns of the data (column 4). Gene what is included in the gene_id attributes of a GTF dataset. For GTF data extracted from the UCSC table browser, the gene_id and transcript_id from any track will have the same value (the transcript_id) for both values. This means that when using this data/source, all summaries using a GTF input will really be only by summarized by transcript_id if parsed from a BED6 dataset (example: uc061ytv.1). This result (transcript instead of gene summary) occurs across tools - whether used in Galaxy or no - if parsed using the transcript/transcript_id value. Other data sources of GTF data will usually have gene_id and transcript_id as distinct values (iGenomes is one example).

To summarize:

If you want to group/count by transcript_id (example: uc061ytv.1), parsing the data as you have done is a correct method.
Some tracks at UCSC include the gene name as a distinct value in the primary table (the full table from UCSC). Example: RefSeq Genes has this value in the "name2" column of the full table. In the UCSC Table browser, click on "view schema" for any table to review the contents. If you used this track instead for the 101 steps (instead of UCSC genes), and extracted the name and name2 values (only) from the primary table as a tabular dataset into Galaxy, the parsed transcripts from the BED6 exons table could be joined with the gene-transcript mapping output and then summarized (grouped) by gene. This might be a good solution for your case.
Other tracks at UCSC do not include the gene name in the primary table, but it is linked in other tables. This linked data can be exported in tabular format (export all or some selected fields from the primary and related tables). Example: UCSC Genes has a linked table named "xref" that contains gene symbols. The process for mapping gene-transcript and summarizing by gene would be the same after that. This could also be a choice, as it would teach how to do more advanced queries and data manipulations.

If you wanted to contribute to the GTN Galaxy tutorials to expand the scope or address new analysis questions, that would be more than welcome. If you find yourself needing to do a step repeatedly that is a bundle of many functions, a shared/published workflow is usually the best choice. If there functions that do not exist that would be useful for others too, you could also make a request to the IUC about creating a new tool or tools.

Hopefully, that helps! Jen, Galaxy team

ADD COMMENT • link written 10 months ago by Jennifer Hillman Jackson ♦ 25k

9 months ago by

jimhu • 20

jimhu • 20 wrote:

Thanks, it sounds like the problem is beyond the scope of the project and the kludges I came up with are good enough for my short term purposes. Am I allowed to close my issues or do you have to do it for me?

ADD COMMENT • link written 9 months ago by jimhu • 20

You can accept my answer or just leave the post as-is, either is Ok. Thanks!

ADD REPLY • link written 9 months ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

GOAL: identify exons by what gene they come from

The regex option

The interval join option

Other

Similar posts • Search »