Question: How to find CpG sites on chromosome 17 in human using galaxy
0
gravatar for chanwoo1143
2.7 years ago by
chanwoo114310
chanwoo114310 wrote:

I would like to know how to find CpG sites on chromosome 17 in human (homo sapiens) using galaxy. I have used ucsc database to do so, but did not get anything out of it. There are the CpG sites I need to find out (cg02228185 in ASPA, cg25809905 in ITGA2B, and cg17861230 in PDE4C), and I know their source sequences. If anyone could help me out in finding these CpG sites, this will literally help me survive my college.

mapping cpg islands • 1.5k views
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by chanwoo114310
0
gravatar for Jennifer Hillman Jackson
2.7 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

This is the target publication (age related CpG sites?): http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4082572/

How was the data in the publication not helpful?

If you have the sequences, BLAT at UCSC can map these to the genomic locations for you. These output coordinates can be loaded into Galaxy, along with the CpG track, then the two intersected to find common overlap. You could also extract a gene track and compare those coordinates with the BLAT mapped sites, if that is what you are looking for. Even cross-genome mapping should be productive in many cases using BLAT, if that is the goal (discovery of these regions in other species).

Various online data providers also likely have these sites exactly mapped (beyond what sequence analysis can do as far as prediction).

Please explain your issue in more detail if this does not produce what you need. We really shouldn't answer exact classroom assignment questions, but can offer analysis guidance.

Thanks, Jen, Galaxy team

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Jennifer Hillman Jackson25k
0
gravatar for chanwoo1143
2.7 years ago by
chanwoo114310
chanwoo114310 wrote:

Hello,

I really appreciate your answer; my question may seem irresponsible, but I study in 3rd year in netherlands. I only found this galaxy website through coursera. I still do not know a lot about how to work with them as I have not had any proper bioinformatics course. This is for my project carrying 18 credits. I have tried so hard and many different methods; asked many teachers for this one. They did not know what to do about it. It may sound like a silly excuse, but I am trying hard to not only work on my project, but also to do other things. There are still so many things I do not know about these softwares.I am kind of trying desperately as I could just fail my project. Could you tell me how to find the overlaps? I do not even know how to load the output coordinates apparently.. I know I am asking a lot.

Regards,

Chanwoo

ADD COMMENTlink written 2.7 years ago by chanwoo114310
0
gravatar for Bjoern Gruening
2.7 years ago by
Bjoern Gruening5.1k
Germany
Bjoern Gruening5.1k wrote:

Hi,

we have a tool that does exactly that for you, at least if I understood it correctly. https://github.com/bgruening/galaxytools/commits/master/tools/find_subsequences

Here is the TS entry: https://toolshed.g2.bx.psu.edu/view/bgruening/find_subsequences/d882a0a75759

Jen would this be something for Galaxy main? Cheers, Bjoern

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Bjoern Gruening5.1k

Hi, I am not at all familiar with github. There are some written programs which i have no skill to work on unfortunately. Is there some other way(e.g. galaxy tool)?

Regards, Chanwoo

ADD REPLYlink written 2.7 years ago by chanwoo114310

which Galaxy server are you using?

ADD REPLYlink written 2.7 years ago by Bjoern Gruening5.1k

Hi Bjoern, Tools to include on Main is a question for the PIs. Maybe ping on IRC? Thanks, Jen

ADD REPLYlink written 2.7 years ago by Jennifer Hillman Jackson25k
0
gravatar for chanwoo1143
2.7 years ago by
chanwoo114310
chanwoo114310 wrote:

Sorry, but I am not sure what you are meaning to say specifically; it almost sounds poetic...

ADD COMMENTlink written 2.7 years ago by chanwoo114310
1

You are logged in into a Galaxy server right? Can you provide the URL to the Galaxy server you are using? Or have you never heard about Galaxy and maybe you are in the wrong biostar channel? :)

ADD REPLYlink written 2.7 years ago by Bjoern Gruening5.1k
0
gravatar for chanwoo1143
2.7 years ago by
chanwoo114310
chanwoo114310 wrote:

I have published my work here: https://usegalaxy.org/u/chawnerd/h/unnamed-history.

Regards,

Chanwoo

ADD COMMENTlink written 2.7 years ago by chanwoo114310
1

Hello,

Here is more help:

Data incompatibility with input's reference genomes:

  1. The CpG extracted from UCSC was based on hg38, where the Bowtie mapping was to hg19. The reference genomes must be the same for intersection tools to work (there are several to choose from). So, I extracted just the regions on chr17 in the target area (specifically, region = chr17:3370546-3390588). If you extract the entire genome, the entire dataset is too large to transfer over from UCSC to Galaxy (there is roughly a 100k line limit and this track is large).

To compare overlap between the Bowtie map coordinates and the hg19 CpG content bed file and preserve all original content (probably what you want):

  1. Convert the Bowtie BAM dataset to BED (BEDTools: Convert from BAM to BED, 6 columns is enough, the data is not spliced).
  2. Operate on Genomic Intervals. The tool Join is based on common overlapping coordinates and will preserve all original data. Or, use the Intersect function just to count up how many from the input overlap with the query (may be useful).

Result: There are no overlapping regions between the mapped Bowtie results and the UCSC CpG island track for hg19. That was my test. You could test out to see if hg38 has more updated track data and re-run the analysis (this means re-mapping with Bowtie this time using hg38 as well, plus adjusting the CpG track region retrieved so that it overlaps with those hits - again, extracting an entire chromosome or the entire genome will almost certainly result in truncated output).

CpG track content for region spanning chr17:3370546-3390588
chr17   3375006 3375237 CpG:_33

Mapped Bowtie content:
chr17   3379546 3379588 ASPA/1  0   -
chr17   3379566 3379616 ASPA2/2 0   -

Another source for CpG island content will be needed if the track data for hg38 does not produce results. Check the publication to see if it is included there, or if it references the source, or try searching other public data repositories for the data.

More is beyond the score of this forum. Visiting https://www.biostars.org, http://seqanswers.com, and others is the place to ask about general analysis and get feedback from those working in this field. Some knowledge/research about how bioinformatics works and exactly what your inputs and desired output (including downstream analysis plans) will make for a better question that is more likely to get a reply from your peers.

To learn more about how to use Galaxy and more about basic bioinformatics in general, please see the top links at this page. This will provide some background in the technology, tools, and tutorials to get started using Galaxy. https://wiki.galaxyproject.org/Support

Thanks and hopefully this helps you to move forward with your project. Bjoern's tool does this particular analysis in a more direct route, but that would involve setting up a local Galaxy of your own, installing the tools/reference data needed, and the like. This is certainly worth doing at some point (now or later) if you plan to work in the field, and will take time the first through since it is new to you, but it isn't overly technical to do a basic install.

Still, the above are the current options when using the public Main Galaxy instance and using it as a learning tool or for this particular simple overlap analysis. Galaxy itself can do very complex analysis, but that is what you will learn as you work with Galaxy and get some experience doing bioinformatics in general. Galaxy simplifies usage of tools, but the scientist doing the work still needs to understand the tools, analysis goals, and underlying technology.

Thanks for using Galaxy, take care, Jen, Galaxy team

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Jennifer Hillman Jackson25k
0
gravatar for chanwoo1143
2.7 years ago by
chanwoo114310
chanwoo114310 wrote:

Thank you for the help. These are the source sequences for each CpG islands.

cg25809905 36 17 39823254 NCBI:RefSeq 36.1 CCAAGAGTAAACAGTGTGCTCAATGCTGTGCCTACGTGTGTTAGCCCACG 39822399 - GeneID:3674 ITGA2B

cg02228185 36 17 3326317 NCBI:RefSeq 36.1 GGTTAGTAATAAATGGTTTTACCTCCAGCCCTGTTCTCTGAATCTCAGCG 3326046 + GeneID:443 ASPA

cg17861230 36 19 18204901 NCBI:RefSeq 36.1 GGATCCGAATAGAAGCGCTGTTGGATGCGGATGGGGCGCCGGGGTTGCCG 18205016 - GeneID:5143 PDE4C

ADD COMMENTlink written 2.7 years ago by chanwoo114310

You could map these to hg19 (Lastz is one choice, the input is fasta format so some manipulations will be needed. I would just use a text editor locally, then upload to Galaxy to keep it simple). After, do the overlapping coordinate comparison. Although from the annotation, I don't think a match will be found. Unless the above data is based on hg38. If this is true (I suggested testing it out anyway): re-map the query fastq to hg38, these to hg38, and then compare.

It looks like you are understanding how this works now. Best wishes for your project. Jen

ADD REPLYlink written 2.7 years ago by Jennifer Hillman Jackson25k
0
gravatar for chanwoo1143
2.7 years ago by
chanwoo114310
chanwoo114310 wrote:

Hi, this is email I got from the author 3 days ago. Dear chanwoo,

Thank you for your email. We have provided a online calculator to predictor biological age using three CpG sites. Please go to http://www.molcell.rwth-aachen.de/epigenetic-aging-signature/ for details.

Best, wishes, Qiong Lin In that link author gave me the three sequences, which I believe are the source sequences for CpG sites used. I also have the three sequences which I mentioned above. None of them (3 from author and 3 from illumina database) matches CpG islands database formed by the hg19 or hg38. The ones author used are the ones from the illumina database. In this case, what is the solution? Indeed, I am learning on how to use it. :)

ADD COMMENTlink written 2.7 years ago by chanwoo114310

I suggested contacting the author again if the data and methods in the publication appear to not be a matchup for their conclusions. They are the truly best resource to help sort this type of issue out.

ADD REPLYlink written 2.7 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 182 users visited in the last hour