I used GALAXY to extract the 1000 bp upstream of all UCSC genes (i.e. promoters). I sorted the data by chromosome number then by start position (i.e. by c1 then by c2). For any gene with multiple isoforms using the same start site, there will be duplicate chr start coordinates and I want to remove these.
Essentially, column 2 contains the start coordinate. I want to remove all lines with a duplicate start coordinate (for a given chromosome).
Thank you in advance for your wonderful help to a student who is still learning the computational basics.