Question: Merging Intervals Within Genes?
0
Denis BAURAIN • 10 wrote:
Hello,
I use the remote version of Galaxy to extract aligned blocks of
3'-UTRs.
For performance reasons, I would like to merge the 3'-UTR exons that
overlap before fetching MAF blocks. The 'merge overlapping intervals'
option is nearly what I need except that it discards any id
information
attached to the exons (see below).
Therefore, I wonder if it would be difficult to implement a 'gene-
aware'
version of the merge operation. In particular, it should not try to
merge overlapping exons from overlapping genes (e.g., one on each
strand).
I have my own implementation of this in Perl, but it is a bit tedious
to
export and re-import interval files just to perform this
'compression'.
# input:
chr10 100133312 100133544 - ENSG00000119943
chr10 100165944 100167310 - ENSG00000107521
chr10 100165945 100167310 - ENSG00000107521
chr10 100166796 100167310 - ENSG00000107521
chr10 100208864 100209320 - ENSG00000172987
chr10 100208866 100209320 - ENSG00000172987
chr10 100209320 100209486 - ENSG00000172987
chr10 100211496 100211532 - ENSG00000172987
...
chr10 12235787 12235825 + ENSG00000065665
chr10 12237333 12237409 + ENSG00000065665
chr10 12246459 12246832 + ENSG00000065665
chr10 12246459 12247368 + ENSG00000065665
chr10 12251332 12251962 + ENSG00000065665
chr10 12248507 12248728 - ENSG00000165609
chr10 12249580 12249706 - ENSG00000165609
chr10 12249581 12249706 - ENSG00000165609
chr10 12251726 12252150 - ENSG00000165609
chr10 12252236 12252857 - ENSG00000165609
...
# current output:
chr10 100133312 100133544
chr10 100165944 100167310
chr10 100208864 100209486
chr10 100211496 100211532
...
chr10 12235787 12235825
chr10 12237333 12237409
chr10 12246459 12247368
chr10 12248507 12248728
chr10 12249580 12249706
chr10 12251332 12252150
chr10 12252236 12252857
...
# desired output:
chr10 100133312 100133544 - ENSG00000119943
chr10 100165944 100167310 - ENSG00000107521
chr10 100208864 100209486 - ENSG00000172987
chr10 100211496 100211532 - ENSG00000172987
...
chr10 12235787 12235825 + ENSG00000065665
chr10 12237333 12237409 + ENSG00000065665
chr10 12246459 12247368 + ENSG00000065665
chr10 12251332 12251962 + ENSG00000065665
chr10 12248507 12248728 - ENSG00000165609
chr10 12249580 12249706 - ENSG00000165609
chr10 12251726 12252150 - ENSG00000165609
chr10 12252236 12252857 - ENSG00000165609
...
Best regards,
Denis BAURAIN