GOAL: identify exons by what gene they come from
Posting this after being cajoled by @galaxyproject to ask support questions here and not just on twitter...
I'm teaching a short module on Big Data for first-year students where I include a couple of sessions on Galaxy. I started with the Galaxy 101 tutorial of finding the exons on a human chromosome with the most SNPs. I wanted to extend the lesson in a couple of ways and build a homework assignment based on asking which genes on a particular chromosome had most SNPs that fall into exons. But then I realized I don't know how to do that myself. Here's what I came up with
The regex option
The exons table resulst looks like this
chr21 5011798 5011874 uc061ytv.1_exon_0_0_chr21_5011799_f 0 +
chr21 5012547 5012687 uc061ytv.1_exon_1_0_chr21_5012548_f 0 +
chr21 5014385 5014471 uc061ytv.1_exon_2_0_chr21_5014386_f 0 +
chr21 5016934 5017145 uc061ytv.1_exon_3_0_chr21_5016935_f 0 +
chr21 5022492 5022693 uc032pql.2_exon_0_0_chr21_5022493_f 0 +
chr21 5025008 5025049 uc032pql.2_exon_1_0_chr21_5025009_f 0 +
chr21 5026279 5026630 uc032pql.2_exon_2_0_chr21_5026280_f 0 +
...
Because the first part of the exon id is the gene id, I could run regex-based search and replace on the first underscore. After much regex debugging because I'm not used to awk, I came up with
Replace text in a specific column Column 4 Find pattern [A-Za-z0-9]+.[0-9]+ Replace pattern &\t&
which gives:
chr21 5011798 5011874 uc061ytv.1 uc061ytv.1_exon_0_0_chr21_5011799_f 0 +
chr21 5012547 5012687 uc061ytv.1 uc061ytv.1_exon_1_0_chr21_5012548_f 0 +
chr21 5014385 5014471 uc061ytv.1 uc061ytv.1_exon_2_0_chr21_5014386_f 0 +
chr21 5016934 5017145 uc061ytv.1 uc061ytv.1_exon_3_0_chr21_5016935_f 0 +
...
I'm not crazy about this based on it relying on a specific identifier format convention (which I guessed - documented where?) and having to teach regexes in a short class.
The interval join option
The alternative would be to take the gene/mRNA list and do an interval join based on coordinate overlaps. This has a couple of drawbacks that I can think of
- Slow (but once done the dataset can be reused)
- Overlapping genes (these exist in bacteriophage; not sure about humans)
Using this tutorial pointed out by Dave Clements, I can even do this using other kinds of identifiers, like UniProt accessions instead of the UCSC ids.
Other
I'm hoping that readers here can fill in this with something that is more generalizable that takes advantage of the existing parent-child feature relationship between exons and genes.