How do I annotate a bed file with latest gencode

Question: How do I annotate a bed file with latest gencode

5 months ago by

Alex.v.nesta • 0

Alex.v.nesta • 0 wrote:

I'm pretty new to this, so go easy on me.

I'm trying to map some PacBio full isoform RNA seq data back to hg38 and then annotate it.

So far, I have used gmap, generated the hg38 map from fasta files, mapped the raw reads back to hg38.

Then I converted my output sam file to a bam file and then to a bed file with bedtools bamtobed.

I have a bed file containing chromosome, start, end, strand, and a proprietary identifier for each isoform.

I want to annotate this bed file with the latest gencode.

Specifically, I need this bed file to look like this:

chr1    PacBio  transcript  27567   29338   .   -   .   gene_id "PB2015.1"; transcript_id "PB2015.1.1";
chr1    PacBio  exon    27567   29338   .   -   .   gene_id "PB2015.1"; transcript_id "PB2015.1.1";
chr1    PacBio  transcript  482637  484073  .   -   .   gene_id "PB2015.2"; transcript_id "PB2015.2.1";
chr1    PacBio  exon    482637  484073  .   -   .   gene_id "PB2015.2"; transcript_id "PB2015.2.1";
chr1    PacBio  transcript  725905  728472  .   +   .   gene_id "PB2015.3"; transcript_id "PB2015.3.1";
chr1    PacBio  exon    725905  727143  .   +   .   gene_id "PB2015.3"; transcript_id "PB2015.3.1";
chr1    PacBio  exon    727486  728472  .   +   .   gene_id "PB2015.3"; transcript_id "PB2015.3.1";

Of specific importance is the column identifying transcript or exon.

I think annotating with the latest gencode will do just that, but it's been a pain in the @$$.

Please, let me know if you can help me out.

My Current Bed File:

chr1    184917  199869  PB2015.9302.2|chr2:114341681-114356616(+)|i1c_c18438/f2p41/1737 40  -
chr1    184918  199872  PB2015.5666.1|chr15:102501809-102516767(+)|newClontech_i0HQ|c18452/f1p15/1961   40  -
chr1    184927  186058  PB2015.5667.1|chr15:102515628-102516758(+)|i1a_c70772/f4p21/1137    3   -
chr1    198086  199125  PB2015.3240.2|chr12:75077-76115(+)|i1c_c52606/f1p18/1048    1   -

Commands I have tried:

bedtools intersect -wb -a 2015MCF7cleaned.bed -b gencode.v28.annotation.gtf | awk -v OFS="\t" '{print $1,$2,$3,$4,$5,$6,$9,$18,$20}'> output.bed

bedtools annotate -i 2015MCF7cleaned.bed -files gencode.v28.annotation.gtf >test.gtf

The first command gets me pretty close, but it is not in proper bed or gtf format, so I can't load it up in IGV. I'm about to do a bunch of manual manipulation in R, but there has to be a better way!

The second command, I think, is the better way, but for some reason it does not work and returns the original file. ARG!

I'm open to any suggestions, including throwing my computer at the wall until it magically solves my issues.

Thanks, Alex

bedtools • 467 views

ADD COMMENT • link •

modified 5 months ago by Jennifer Hillman Jackson ♦ 25k • written 5 months ago by Alex.v.nesta • 0

5 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Those tools will not convert a BED file to a GTF file (within Galaxy or line command).

You need to assemble the reads and include reference annotation to link in the data and produce a GTF result. Please see the Galaxy RNA-seq tutorials to better understand how to process/annotate your data: https://galaxyproject.org/learn/

That said, you could also join the two datasets by overlapping coordinates, but the results will be messy (many, many duplications) will not be in GTF format -- so I wouldn't recommend it.

Thanks, Jen, Galaxy team

ADD COMMENT • link written 5 months ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »