Question: Just getting started with Galaxy 101 Tutorial
0
gravatar for Bob
14 months ago by
Bob0
United States
Bob0 wrote:

Hello I'm just getting started with the Galaxy 101 tutorial and I wanted to report what might be a few problems. I'm not sure if this is the place to make a report (or get help) since the tutorial is at https://github.com/nekrut/galaxy/wiki/Galaxy101-1. I am running with my own local copy of Galaxy that I recently downloaded.

Disclaimer: I'm a software developer / engineer with very little experience in biology.

  1. The instructions for the 101 tutorial seem to be a bit outdated. They reference data sets that aren't available in the location mentioned (although a little poking around revealed them in a nearby archive). They also refer to menu items that have been moved/renamed in the interface.

  2. It might also be good to provide an estimate of the processing time and resulting database size for the join operation. With the data I used (which was a bit of a guess ... see note 1 above), the join operation appeared to hang. I restarted it a few times, with the same result. Eventually I decided to let it run over night, and it did finish, but it generated a large data set that alarmed my system administrator. If this is expected, it might be good to mention these in the tutorial. If this isn't expected, then I may be doing something wrong and welcome any help available.

If anyone can update the tutorial (or direct me to a more updated version) that would be very welcome!!

Thanks.

help job galaxy-101 tutorials • 514 views
ADD COMMENTlink modified 14 months ago • written 14 months ago by Bob0

Super. I'll try again as soon as my system administrator finishes migrating me to a new scratch location.

My unexpected use of 700GB took him by surprise, and he wants to isolate my future experiments. : )

ADD REPLYlink modified 14 months ago • written 14 months ago by Bob0
1
gravatar for Jennifer Hillman Jackson
14 months ago by
United States
Jennifer Hillman Jackson23k wrote:

Hello,

The current version is not very old .. and I can explain what I think is going on and set some expectations.

For the specific long running Join job with large output: My guess is that at the step where the data is extracted from UCSC, the entire genome region was selected, when the tutorial instructions are to filter by a single chromosome. This is a common mistake for those new to the UCSC Table browser and there are specific instructions around this filtering along with graphics. Maybe try again making sure that this step is done with the filter?

For the database queries: The specific database versions noted in the tutorial change every night (RefSeq) to every few months (dbSNP). Using the latest version of each is just fine. The goal is to learn the steps and interface - the actual results won't be exact and that is expected.

For job run time estimates: How long the query will take will depend on where the job is run (public server, local, cloud instance). The resources available and other concurrent tasks make a big difference - just as command line jobs will - so estimating not practical. However, I can give you a general time estimate to complete the whole thing (also noted in the tutorial) - the entire tutorial might take anywhere from 45 minutes up to 2 hours for new users on a reasonably robust local or cloud Galaxy (at least 16 GB ram) as well as the public Main Galaxy server at http://usegalaxy.org (as long as these are the only queries running from the account).

General advice for running jobs in Galaxy: It is almost never a good idea to stop and restart jobs. Restarted jobs are added to the back of the queue - which always increases wait time. Deleted job take resources to clear, which can cause delays on the database side, especially if there are several. The best advice is to start jobs and allow them to complete - unless you know that the job had an entry error and it needs to be redone (not simply an exact "re-run"). More: https://wiki.galaxyproject.org/Support#Dataset_status_and_how_jobs_execute

Thanks for the feedback, Jen, Galaxy team

ADD COMMENTlink modified 14 months ago • written 14 months ago by Jennifer Hillman Jackson23k
1
gravatar for Mo Heydarian
14 months ago by
Mo Heydarian790
United States
Mo Heydarian790 wrote:

Hi Bob,

It seems like your local instance of Galaxy does not have the Join tool you need. To join features based on genomic position, you need to use the "Join the INTERVALS of two datasets side-by-side" tool. For reference, you can go to useGalaxy.org and see the version of Join you need (Operate on Genomic Intervals -> Join the INTERVALS of two datasets side-by-side).

You can add this tool to your local Galaxy by giving yourself admin privileges and installing the tool from the toolshed onto your local instance. Instructions on how to do so can be found here (https://wiki.galaxyproject.org/Admin/GetGalaxy) under the 'Become an Admin' header.

Hope this helps and feel free to respond with comments and questions.

Thanks for using Galaxy!

Cheers, Mo Heydarian

ADD COMMENTlink written 14 months ago by Mo Heydarian790
0
gravatar for Bob
14 months ago by
Bob0
United States
Bob0 wrote:

The example shows:

track: GENCODE v22

But I can't find that particular track in the data I've downloaded.

It has (among many other choices):

GENCODE v24
All GENCODE v24
All GENCODE v23
All GENCODE v22

I don't recall, but I may have used "All GENCODE v22" in my previous attempt. Could that account for my 700GB data set? Should I choose something else?

The next difference may just be a user interface change. The example says:

This is done using Operate on Genomics Intervals -> Join tool

But my version of Galaxy doesn't show that option. It does, however, have a "Join, Subtract and Group" option containing a "Join two Datasets" option. That's what I had used before.

When I click "Execute" for the join (after putting them in the correct order - Exons first, SNPs second), the History shows the following in yellow:

3: Join two Datasets on data 2 and data 1

Then it just sits there with the spinning icon while it does the join. I also get a bunch of messages:

GET /api/histories/0a248a1f62a0cc04/contents?v=dev&q=update_time-ge&qv=2016-10-01T05%3A15%3A13.000Z HTTP/1.1" 200 - "http://127.0.0.1:8080/"

I've tried refreshing the history, but it just sits there spinning.

How long should that step take?

Also, when I check what I assume is the output folder (galaxy/database/files/000) I'm already seeing a dataset_19.dat file that's at 50 gigabytes and growing steadily. It looks well on its way toward another 700GB result. The dataset_17.dat file is 933KB and the dataset_18.dat file is 17MB.

Here's the "Job Command-Line":

python /scratch/Galaxy/galaxy/tools/filters/join.py /scratch/Galaxy/galaxy/database/files/000/dataset_17.dat /scratch/Galaxy/galaxy/database/files/000/dataset_18.dat 1 1 /scratch/Galaxy/galaxy/database/files/000/dataset_19.dat --index_depth=3 --buffer=50000000 --fill_options_file=/scratch/Galaxy/galaxy/database/jobs_directory/000/19/tmpYv55_l

Eventually, I had to end that job to keep from filling my scratch drive.

Any ideas on what I might be doing wrong?

ADD COMMENTlink modified 14 months ago • written 14 months ago by Bob0
0
gravatar for Bob
14 months ago by
Bob0
United States
Bob0 wrote:

The tutorial says:

"Make sure that your settings are exactly the same as shown on the screen"

The screen shot shows:

track: GENCODE v22

But the choices available are:

  • GENCODE v24
  • RefSeq Genes
  • All GENCODE V24
  • All GENCODE V23
  • All GENCODE V22
  • GENCODE V20 (Ensembl 76)
  • RetroGenes V9
  • Augustus
  • CCDS
  • Geneid Genes
  • Genscan Genes
  • IKMC Genes Mapped
  • lincRNA RNA-Seq
  • lincRNA TUCP
  • LRG Transcripts
  • MGC Genes
  • Old UCSC Genes
  • ORFeome Clones
  • Other RefSeq
  • Pfam in UCSC Gene
  • SGP Genes
  • SIB Genes
  • sno/miRNA
  • tRNA Genes
  • UCSC Alt Events
  • UniProt

Which one is correct for this tutorial?

ADD COMMENTlink written 14 months ago by Bob0

GENCODE v24 is the correct choice as a replacement for GENCODE 22 (24 is just the latest version).

However - I am wondering if you are really using the same tutorial = https://github.com/nekrut/galaxy/wiki/Galaxy101-1

The most current the tutorial instructs to UCSC Known Genes as the gene track. And to limit it to a single chromosome "chr22" << that part if very important. This is set under "regions" on the UCSC Table browser form as shown in the graphic.

Also - the two join tools are very different. One joins on a common key (value in a column). The other joins by looking for genomic footprint overlap. You want to use the second - these tools are in the Tool Shed. Go to http://usegalaxy.org/toolshed and search for "GOPS" to locate the repo for review: suite_gops_1_0. Now go in through the Admin functions of your Galaxy instance and install this for use.

ADD REPLYlink written 14 months ago by Jennifer Hillman Jackson23k
0
gravatar for Bob
14 months ago by
Bob0
United States
Bob0 wrote:

Thanks for your response. I too have been wondering if we're looking at the same tutorial. For example, it doesn't say "UCSC Known Genes" in the version I've been using. I've reproduced what I see at: https://github.com/nekrut/galaxy/wiki/Galaxy101-1 starting with step 1. below:


1. Getting data from UCSC

1.0. Getting coding exons

First thing we will do is to obtain data from UCSC by clicking Get Data -> UCSC Main:

You will see UCSC Table Browser interface appearing in your browser window:

enter image description here

Make sure that your settings are exactly the same as shown on the screen (in particular, position should be set to "chr22", output format should be set to "BED - browser extensible data", and "Galaxy" should be checked within the Send output to option). Click get output and you will see the next screen:

enter image description here

here make sure Create one BED record per: is set to "Coding Exons" and click Send Query to Galaxy button. After this you will see your first History Item in Galaxy's right pane. It will go through gray (preparing) and yellow (running) states to become green:

enter image description here

1.1. Getting SNPs

Now is the time to obtain SNP data. This is done almost exactly the same way. First thing we will do is to again click on Get Data -> UCSC Main:

enter image description here

but now change group to "Variation":

enter image description here

so that the whole page looks like this:

enter image description here

click get output and you should see this:

enter image description here

where you need to make sure that Whole Gene is selected ("Whole Gene" here really means "Whole Feature") and click Send Query to Galaxy button. You will get your second item in the history:

enter image description here

Now we will rename the two history items to "Exons" and "SNPs" by clicking on the Pencil icon adjacent to each item. After changing the name scroll down and click Save. Also we will rename history to "Galaxy 101 (2015)" (or whatever you want) by clicking on Unnamed history so everything looks like this:

enter image description here

2. Finding Exons with the highest number of SNPs

2.0. Joining exons with SNPs

Let's remind ourselves that our objective was to find which exon contains the most SNPs. This first step in answering this question will be joining exons with SNPs (a fancy word for printing exons and SNPs that overlap side by side). This is done using Operate on Genomics Intervals -> Join tool:

enter image description here

make sure your Exons are first and SNPs are second and click Execute. You will get the third history item:

enter image description here


That's when my version hangs.

ADD COMMENTlink modified 14 months ago • written 14 months ago by Bob0

I see now. Some of the graphics have Gencode, some have Known Genes. Either is Ok to use (as are most tracks in the group). Just be sure to limit the region by chromosome and use the correct Join tool - see my prior post. Jen

ADD REPLYlink written 14 months ago by Jennifer Hillman Jackson23k

In my version, the left panel shows:

  • Get Data
  • Send Data
  • Collection Operations
  • Text Manipulation
  • Filter and Sort
  • Join, Subtract and Group
  • Convert Formats
  • Extract Features
  • Fetch Sequences
  • Fetch Alignments
  • Statistics
  • Graph/Display Data
  • Workflows
  • All workflows

The "Join, Subtract and Group" option has suboptions:

  • Join two Datasets side by side on a specified field
  • Compare two Datasets to find common or distinct rows
  • Group data by a column and perform aggregate operation on other columns.

Is the first of those suboptions the one to use? And (see below) do I have to do anything with the additional fields (columns) that aren't shown in the earlier version?

ADD REPLYlink written 14 months ago by Bob0
0
gravatar for Bob
14 months ago by
Bob0
United States
Bob0 wrote:

Note that the "Join" page above is different from the one in the version of Galaxy that I have. Mine shows:

Join two Datasets side by side on a specified field (Galaxy Version 2.0.2)

Join
  1. Exons
using column
  Column: 1
with
  2: SNPs
and column
  Column: 1
Keep lines of first input that do not join with second input
  No
Keep lines of first input that are incomplete
  No
Fill empty columns
  No
Execute

Could that make a difference?

ADD COMMENTlink modified 14 months ago • written 14 months ago by Bob0

Yes, there are two Join tools. You want the one that compares overlapping coordinates, not common columns. See my prior post about the correct tool repo to install into your local. - Jen

ADD REPLYlink written 14 months ago by Jennifer Hillman Jackson23k

So the option to:

"Join two Datasets side by side on a specified field"

is NOT the one to use?

I didn't see any instructions about installing different tools.

"Galaxy 101" might have been a mild misnomer.


I'm adding to this post since the anti-spam feature is limiting me to 5 posts in 6 hours.

I've installed the "suite_gops_1_0" tool, but I don't know how to find the new "Join" tool. The tutorial showed the "Join" under the "Operate on Genomic Intervals" menu:

enter image description here

But I'm not seeing it on my version (2.0.2). Do I need to restart?

ADD REPLYlink modified 14 months ago • written 14 months ago by Bob0
0
gravatar for Bob
14 months ago by
Bob0
United States
Bob0 wrote:

Thanks Mo.

For anyone following along, I found the newly added "Join" function under the "Get Data" menu (on the left side). With that tool, I was able to complete the Galaxy 101-1 tutorial.

With the best of intentions, let me observe that the tutorial (especially with a "101" name) could use some updating. Just adding a few of the comments from this topic might save new users a few hours of frustration.

Thanks for the help ... now on to Galaxy 101-2.


Galaxy 101-2 was uneventful. Everything worked as per the tutorial. Thanks.

ADD COMMENTlink written 14 months ago by Bob0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 91 users visited in the last hour